CN116612382A

CN116612382A - Urban remote sensing image target detection method and device

Info

Publication number: CN116612382A
Application number: CN202310405274.8A
Authority: CN
Inventors: 蓝金辉; 张铖
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2023-04-17
Filing date: 2023-04-17
Publication date: 2023-08-18

Abstract

The application discloses a method and a device for detecting an urban remote sensing image target, comprising the following steps: obtaining an urban remote sensing image, and preprocessing the image to obtain a subgraph; inputting the subgraph into a mixed attention backbone network for feature extraction to obtain a feature map; constructing a dual detection network, and processing the feature map to obtain a prediction rotation boundary frame; obtaining the deviation between the predicted rotation bounding box and the true value by using a smoothz loss function, and obtaining a new bounding box by iteratively optimizing the loss value; and reserving an optimal boundary box, and outputting a final detection result. The method and the device can realize accurate detection of urban remote sensing targets with different directions under the overlooking view aiming at satellite or airborne urban remote sensing images, and have universality.

Description

Urban remote sensing image target detection method and device

Technical Field

The application relates to the technical field of target detection in computer vision, in particular to a method and a device for detecting an urban remote sensing image target.

Background

At present, remote sensing image target detection facing targets in different directions is always a difficult problem in the field. In the existing remote sensing image target detection technology, the detection method can be mainly divided into a method based on traditional machine learning and a method based on deep learning. Traditional machine learning based methods search for objects and sort through sliding windows on a given image and typically require manual design of features. The deep learning-based method generally comprises four steps of image feature extraction, image feature fusion, object classification regression and back propagation.

For the traditional remote sensing image target detection task, most algorithms generally adopt a sliding window method to acquire candidate areas and perform classification and identification of interested targets on the basis of the candidate areas, the method needs manual design features in advance, the designed features sometimes cannot effectively extract image feature information, and the remote sensing image has targets with different scales and different angle directions, so that the application of the traditional target detection method on the remote sensing image is further hindered. With the continuous development of deep learning technology, more and more deep learning algorithms are applied to the remote sensing field, the special big data migration learning method of deep learning enables the remote sensing information extraction technology to be further improved, the bottom traditional features such as textures and shapes of remote sensing images are greatly utilized, and classification and identification of remote sensing targets are enabled to be quicker and more accurate by extracting semantic features of the remote sensing images, so that the accuracy of remote sensing image target detection is greatly improved. However, in recent years, most of remote sensing image target detection algorithms are still universal horizontal frame detection methods, and although the problems of multiple scales and multiple targets of the remote sensing image can be solved to a certain extent, the problems of multiple angles of the target of the remote sensing image, especially the problems of background redundancy interference caused by objects and inaccurate positioning of a boundary frame, can not be well solved. When the object is closely discharged, a plurality of boundary frames are overlapped in a large amount due to the defects of the general horizontal frame, and useless information such as excessive background is introduced, so that the detection effect of the remote sensing image is greatly affected. Therefore, the research of the remote sensing image target detection algorithm based on the rotating frame is an important research direction in the remote sensing field.

Disclosure of Invention

The application provides a method and a device for detecting an urban remote sensing image target, which are used for solving the technical problem that the prior art is difficult to obtain a better detection effect on the urban remote sensing target with changeable angles.

In order to solve the technical problems, the application provides the following technical scheme:

in a first aspect, an embodiment of the present application provides a method for detecting an urban remote sensing image target, including:

obtaining an urban remote sensing image, and preprocessing the image to obtain a subgraph;

inputting the subgraph into a mixed attention backbone network for feature extraction to obtain a feature map;

constructing a dual detection network, and processing the feature map to obtain a prediction rotation boundary frame;

obtaining the deviation between the predicted rotation bounding box and the true value by using a smoothz loss function, and obtaining a new bounding box by iteratively optimizing the loss value;

and reserving an optimal boundary box, and outputting a final detection result.

Further, the urban remote sensing image is a visible light image shot by a satellite or an airborne sensor;

the preprocessing is to cut the original picture into a plurality of small pictures to be input into a network attention backbone network, and the sub-picture prediction result is combined into a large picture through jigsaw and post-processing.

Further, the feature extraction includes:

urban remote sensing image I _a Input into a deep convolutional neural network model, global and local information of an object in an image is input by mixing self-attentionLine feature extraction and final output of information integration feature map I _b 。

Further, a detection decoupling network in a dual detection network is constructed, and category information and position angle information of the target are respectively predicted through split classification regression operation, including:

for input feature map I _b Classification processing is carried out to obtain category information C of the target, regression operation is carried out to obtain position and angle information (x, Y, w, h and theta) of the target, wherein x and Y are respectively the abscissa and the ordinate of the center point of the boundary frame, h and w are respectively the length and the width of the boundary frame, and theta is the rotation angle of the boundary frame.

Further, constructing an angle correction network in the dual detection network, and simultaneously obtaining corrected angle information of the target through angle regression operation to obtain a corrected predicted rotation bounding box, wherein the method comprises the following steps:

map I of the characteristics _b And (3) similarly, inputting the angle correction network to perform regression operation to obtain correction angle information theta', solving L1 norms of theta and theta to obtain deviation delta theta, if delta theta is larger than a preset threshold value x, giving theta to obtain the purpose of correcting the rotation angle, otherwise, keeping unchanged, fusing the obtained position and angle information, and finally outputting the predicted rotation boundary frame.

Further, a smooth-z loss function is provided to obtain a deviation between a predicted rotation bounding box and a true value, and a new bounding box is obtained by iteratively optimizing a loss value, including:

based on the obtained predicted rotation boundary box, obtaining an initial loss value between the predicted rotation boundary box and a true value by using a loss function; then re-extracting characteristic points of the urban remote sensing image Ia, and iterating for preset times to obtain I _a Loss value set under real label supervisionSelect->As a loss value L _a To update the position angle information; wherein, the liquid crystal display device comprises a liquid crystal display device,/>representing the loss value L obtained by the nth iteration _a And (2) N represents the preset iteration times, and max represents the iteration times with the minimum loss value. And finally converging the loss function to obtain a new boundary box.

Further, the reserving the optimal bounding box, outputting the final detection result, includes:

and regenerating a new bounding box list for all the obtained bounding boxes, and then sequencing the bounding boxes through formula calculation to obtain the coordinates and confidence scores of the optimal bounding boxes.

In a second aspect, an embodiment of the present application further provides an urban remote sensing image target detection device, where the method for implementing any one of the embodiments of the first aspect of the present application includes: the device comprises an acquisition module, a detection module and a selection module.

The acquisition module is used for acquiring the urban remote sensing image, and preprocessing the image to acquire a subgraph.

The detection module comprises a mixed attention unit, a dual detection network unit and an optimization unit. The mixed attention unit is used for extracting the characteristics of the image and acquiring a characteristic diagram. And the dual detection network unit obtains a prediction rotation boundary box by processing the feature map. And the optimizing unit acquires the deviation between the predicted rotation bounding box and the true value through a smooth-z loss function, and acquires a new bounding box through iterative optimization of the loss value.

The selection module is used for reserving an optimal boundary box and outputting a final detection result.

The embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method according to any of the embodiments of the first aspect of the present application.

The embodiment of the application also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of being run on the processor, wherein the processor realizes the method according to any embodiment of the first aspect of the application when executing the computer program.

The technical scheme provided by the application has the beneficial effects that at least:

according to the urban remote sensing image target detection method and device, firstly, an image subgraph is obtained through preprocessing, the target is subjected to feature extraction through a mixed attention backbone network, then, the feature graph is subjected to classification regression and angle correction through a dual detection network, the deviation between a predicted value and a true value is optimized through a smoothz loss function, and finally, a boundary box is subjected to optimal screening, so that accurate target detection of the urban remote sensing image is achieved. The detection method provided by the application can realize accurate detection of the rotating target aiming at the urban remote sensing image.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for detecting an urban remote sensing image target according to an embodiment of the application;

FIG. 2 is a schematic diagram of a mixed attention network for obtaining feature maps according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a dual detection network for obtaining a prediction rotation bounding box according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a method for optimizing and updating a smooth-z loss function of a new bounding box according to an embodiment of the present application;

FIG. 5 is a flowchart of an optimal bounding box screening method according to an embodiment of the present application;

fig. 6 is an embodiment of the urban remote sensing image target detection device of the application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The embodiment of the application provides a city remote sensing image target detection method, as shown in fig. 1, which comprises the following steps:

step 110, acquiring an urban remote sensing image, and preprocessing the image to acquire a subgraph;

step 120, inputting the subgraph into a mixed attention backbone network for feature extraction to obtain a feature map;

130, constructing a dual detection network, and processing the feature map to obtain a prediction rotation boundary frame;

step 140, obtaining the deviation between the predicted rotation bounding box and the true value by using a smoothz loss function, and obtaining a new bounding box by iteratively optimizing the loss value;

and 150, reserving an optimal boundary box and outputting a final detection result.

Aiming at the problem of remote sensing image target detection, the embodiment provides a novel urban remote sensing image target detection method which can be realized by electronic equipment. The method comprises the steps of inputting urban remote sensing images into a computer, extracting image features based on mixed attention, carrying out position angle regression and angle correction information on feature images by using a dual detection network, solving rotating frame deviation by using a smooth-z loss function, screening boundary frames by adopting an optimal boundary frame screening method, and outputting a final rotating detection frame.

Further, as shown in fig. 2, to implement step 120, the sub-graph input mixed attention backbone network performs feature extraction to obtain a feature graph, which specifically includes:

and extracting the characteristics of the input city remote sensing subgraphs by adopting a method of a mixed attention mechanism.

The main process of the mixed attention mechanism is that an input feature map is subjected to feature compression through convolution operation, the compressed feature map is input into a multi-head self-attention module to extract spatial features of an image key region, then the feature map is expanded to the original size through up-sampling operation and is spliced and connected with the input feature map, and the fused feature map is subjected to the operation again to obtain local features. The formula for self-attention is as follows:

wherein Q represents an inquiry vector, K represents a key vector, V represents a weight, d _k Representing the vector k ⁱ And T represents the transposed matrix.

And simultaneously carrying out maximum value pooling and average value pooling on the feature images, carrying out channel addition operation on the pooled feature images, and then carrying out convolution and activation operation on the added feature images to obtain image global features. And the two new feature graphs are spliced into one feature graph, namely the mixed attention processing process.

Further, as shown in fig. 3, to implement step 130, a dual detection network is constructed, and the feature map is processed to obtain a prediction rotation bounding box, which specifically includes:

and constructing a detection decoupling network in the dual detection network, and respectively predicting the category information and the position angle information of the target through split classification regression operation.

Classifying the feature map, wherein the specific mode is as follows: by treating the tag class of the object as a discrete value, the object class can be treated as a classification problem. The network detection head adopts a classifier, the output of the classifier also has corresponding output quantity according to the quantity of target categories to be predicted, the output quantity respectively corresponds to the prediction scores of the positive samples belonging to a certain category, and the urban remote sensing image training set is assumed to be { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…,(x _n ,y _n ) X, where x _i An output feature vector representing the detection head; y is _i Representing the true value of the sample, wherein the value is a preset label true category; n is the number of training set samples. After network forward propagation, the output of the classifier can be expressed as

Wherein T represents the transpose of the matrix, w _i Is a weight parameter for connecting the neuron in the detection head with the ith output neuron of the softmax classifier; h is a _w (x ₁ ) Is a probability vector, the sum of each item in the vector is 1, each item represents the probability value of the sample belonging to the corresponding category, and the category with the highest probability is taken as the classification result.

And carrying out regression operation on the feature map, wherein the specific mode is as follows: firstly modeling the translation change, the size change and the angle change of the target frame, then adopting L2 norm regularization to carry out least square linear regression so as to avoid overfitting data with high-dimensional characteristics, and finally outputting the target frame after finishing. The purpose of the bounding box regression is to learn a mapping of the correct label box (group-trunk) to the region candidate box (Region Proposal box). P= (P) _x ，P _y ，P _w ，P _h ，P _θ ) Where P represents the region candidate frame, x, y, w, h, θ represent the abscissa of the center of the rectangular frame of the image and the length and width of the rectangular frame and the rotation angle of the rectangular frame, respectively. G= (G) _x ，G _y ，G _w ，G _h ，G _θ (ii) wherein G represents the correct label box. Through five learnable functions S _x (P)，S _y (P)，S _w (P)，S _h (P)，S _θ (P) converting the region candidate frame P into the correct label frame G, in which case the converted bounding frame G' does not normally completely coincide with the correct label frame G due to the presence of errors. The P to G' transformation includes a translation transformation and a scale transformation of the bounding box. S is S _x (P)，S _y (P) corresponds to the following bounding box translation transformation: g'. _x ＝P _w S _x (P)+P _x ，G′ _y ＝P _h S _y (P)+P _y ，S _w (P)，S _h (P) corresponds to the following bounding box translation transformation:

G′ _w ＝P _w exp(S _w (P))，G′ _h ＝P _h exp(S _h (P))，

S _θ (P) corresponds to the following angular transformation: g'. _θ ＝P _θ +S _w (P)+kπ。

And constructing an angle correction network in the dual detection network, and obtaining corrected angle information of the target through angle regression operation to obtain a corrected prediction rotation boundary frame.

The specific method is as follows: map I of the characteristics _b Similarly, the correction angle information θ 'is obtained by performing regression operation in the correction network, the respective deviations of θ and θ' and the true value are obtained by obtaining the L1 norm, if the deviation obtained by θ 'is minimum, θ' is given to θ to obtain the purpose of correcting the rotation angle, otherwise, the rotation angle is kept unchanged, and the angle correction transformation can be expressed as Δθ=min ((P) _θ′ -G _θ )，(P _θ -G _θ ) Using an angle-modifying transformation Δθ to replace S _w (P) obtaining the corrected angle G' _θ 。

Further, as shown in fig. 4, to implement step 140, a deviation between the predicted rotation bounding box and the true value is obtained by using a smooth-z loss function, and a new bounding box is obtained by iteratively optimizing the loss value, which specifically includes:

and obtaining the deviation between the prediction rotation bounding box and the true value by using a smoothz loss function, and obtaining a new bounding box by iteratively optimizing the loss value.

The design process of the smoth-z loss function is as follows:

when the rotating frame boundary is calculated in the training process, the correlation of the universal detection head and the horizontal boundary frame is weak in the classifying and regression process of the universal detection head, so that the classification score and the regression positioning cannot be effectively correlated together, and the obtained result is not reliable enough. Therefore, the association degree of the matching degree and the measurement is adopted in the anchor frame allocation, the regression loss is promoted to further converge, and the regression parameters are as follows:

wherein x, y, w, h and θ respectively represent the center coordinates, width, height and angle of the real frame; x is x _a ，y _a ，w _a ，h _a ，θ _a Respectively representing the center coordinates, width, height and angle of the anchor frame; x ', y ', w ', h ', θ ' represent the center coordinates, width, height and angle of the prediction bounding box, respectively; l (L) _x Representing the x deviation of the anchor frame from the true value, l _y ，l _w ，l _y Lθ is the same; l's' _x Representing the x-deviation of the prediction bounding box from the anchor box, l' _y ，l′ _w ，l′ _h ，l′ _θ And the same is true.

Based on the above, in this embodiment, the optimization iterative optimization execution process is as follows:

(1) Calculating the state and the activation value of each layer until the last layer;

(2) Calculating the error of each layer, wherein the error calculating process is advanced from the last layer;

(3) Calculating a gradient of each neuron connection weight;

(4) The parameters are updated according to the gradient descent law.

The above steps are iterated until the stopping criterion is met.

Further, as shown in fig. 5, in order to implement step 150, an optimal bounding box screening method is proposed to keep an optimal bounding box, and a final detection result is output, which specifically includes:

first, a new bounding box list is regenerated for each target bounding box, and then the coordinates and confidence scores of the optimal bounding box are obtained through formula calculation. Wherein the confidence of the optimal bounding box is set to be the average confidence of all the boxes forming it, the coordinates of the optimal bounding box are a weighted sum of the coordinates of the boxes constituting it, wherein the weights are the confidence scores of the corresponding boxes, and the calculation formula is as follows

Wherein C is the confidence of the optimal bounding box, C _i For the confidence of the ith detection frame in the list, A is the optimal selection coefficient, (x, y) is the coordinates of the fusion frame in the updated list, and N represents the number of boundary frames. Thus, a box with a higher confidence may contribute more to the fused box coordinates than a box with a lower confidence.

In order to implement the method according to any one of the embodiments of the first aspect of the present application, an embodiment of the present application further provides an apparatus for detecting an urban remote sensing image target, as shown in fig. 6, including:

the acquiring module 610 acquires an urban remote sensing image, and performs preprocessing on the image to acquire a subgraph, as shown in step 110;

the detection module 600 further comprises a mixed attention unit 620, a dual detection network unit 630, an optimization unit 640. The mixed attention unit is configured to perform feature extraction on an image to obtain a feature map, in step 120;

the dual detection network element obtains a prediction rotation bounding box by processing the feature map, as in step 130.

The optimizing unit obtains the deviation between the predicted rotation bounding box and the true value through the smooth-z loss function, and obtains a new bounding box through iterative optimization of the loss value, as shown in step 140.

The selection module 650 is configured to retain an optimal bounding box and output a final detection result, as in step 160.

Further, the modules of the apparatus of the present application are used to implement the further optimized embodiments of steps 110 to 150, see fig. 2 to 6 and the related description, and are not described herein again.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Furthermore, it should be noted that the present application can be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

It is finally pointed out that the above description of the preferred embodiments of the application, it being understood that although preferred embodiments of the application have been described, it will be obvious to those skilled in the art that, once the basic inventive concepts of the application are known, several modifications and adaptations can be made without departing from the principles of the application, and these modifications and adaptations are intended to be within the scope of the application. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.

Claims

1. The city remote sensing image target detecting method is characterized by comprising the following steps:

and reserving an optimal boundary box, and outputting a final detection result.

2. The method for detecting the urban remote sensing image target according to claim 1, wherein the urban remote sensing image is a visible light image shot by a satellite or an airborne sensor;

3. The method for detecting an urban remote sensing image target according to claim 2, wherein the feature extraction comprises:

urban remote sensing image I _a Inputting the image data into a deep convolutional neural network model, extracting features of global and local information of a target in the image through mixed self-attention, and finally outputting an information integration feature map I _b 。

4. The method for detecting an urban remote sensing image target according to claim 1, wherein constructing a detection decoupling network in a dual detection network, respectively predicting category information and position angle information of the target by splitting classification regression operation, comprises:

for input feature map I _b Classification processing is carried out to obtain category information C of the target, regression operation is carried out to obtain position and angle information (x, y, w, h and theta) of the target, wherein x, y are respectively the abscissa and the ordinate of the center point of the boundary frame, h and w are respectively the length and the width of the boundary frame, and theta is the rotation angle of the boundary frame.

5. The method for detecting an urban remote sensing image target according to claim 1, wherein constructing an angle correction network in a dual detection network, and obtaining corrected angle information of the target through an angle regression operation, and obtaining a corrected predicted rotation bounding box, comprises:

6. The method for detecting an urban remote sensing image target according to claim 5, wherein the step of obtaining the deviation between the predicted rotation bounding box and the true value by using a smoothz loss function, and obtaining a new bounding box by iteratively optimizing the loss value, comprises:

based on the obtained predicted rotation boundary box, obtaining an initial loss value between the predicted rotation boundary box and a true value by using a loss function; then for city remote sensing image I _a Re-extracting the characteristic points, and iterating for preset times to obtain I _a Loss value set under real label supervisionSelect->As a loss value L _a To update the position angle information; wherein (1)>Representing the loss value L obtained by the nth iteration _a N represents the preset iteration times, and max represents the iteration times with the minimum loss value; and finally converging the loss function to obtain a new boundary box.

7. The method of claim 1, wherein the retaining the optimal bounding box and outputting the final detection result comprises:

8. An urban remote sensing image target detection device for implementing the method of any one of claims 1 to 7, comprising:

the acquisition module is used for acquiring an urban remote sensing image, and preprocessing the image to acquire a subgraph;

the detection module comprises a mixed attention unit, a dual detection network unit and an optimization unit;

the mixed attention unit is used for extracting the characteristics of the image and acquiring a characteristic diagram.

And the dual detection network unit obtains a prediction rotation boundary box by processing the feature map.

And the optimizing unit acquires the deviation between the predicted rotation bounding box and the true value through a smooth-z loss function, and acquires a new bounding box through iterative optimization of the loss value.

And the selection module is used for reserving an optimal boundary box and outputting a final detection result.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor to perform the method of any one of claims 1 to 7 when the computer program is executed by the processor.