CN116994164A - Multi-mode aerial image fusion and target detection combined learning method - Google Patents

Multi-mode aerial image fusion and target detection combined learning method Download PDF

Info

Publication number
CN116994164A
CN116994164A CN202311058440.8A CN202311058440A CN116994164A CN 116994164 A CN116994164 A CN 116994164A CN 202311058440 A CN202311058440 A CN 202311058440A CN 116994164 A CN116994164 A CN 116994164A
Authority
CN
China
Prior art keywords
image
fusion
target detection
feature
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311058440.8A
Other languages
Chinese (zh)
Inventor
于银辉
孙旭
余雨萍
方兆帆
刘雨晗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN202311058440.8A priority Critical patent/CN116994164A/en
Publication of CN116994164A publication Critical patent/CN116994164A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Remote Sensing (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a multi-mode aerial image fusion and target detection combined learning method, which belongs to the technical field of remote sensing image processing and comprises the following steps: a first fusion image is generated by adopting paired visible light images and infrared images to initially train image fusion branches; designing a target detection branch with expert guiding information based on the first fused image; performing preliminary training on the target detection branch by adopting a first fusion image; outputting an expert feature map through the target detection branch after preliminary training; performing feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature map after feature alignment processing to generate a second fusion image; fine tuning the target detection branch by adopting a second fusion image to realize the target detection task optimization; the method improves the performance of two tasks of visible light and infrared aerial image fusion and target detection at the same time, and provides more accurate and efficient data analysis and decision support for unmanned aerial vehicle application.

Description

Multi-mode aerial image fusion and target detection combined learning method
Technical Field
The invention belongs to the technical field of remote sensing image processing, and particularly relates to a multi-mode aerial image fusion and target detection combined learning method.
Background
In recent years, multi-mode aerial images captured by unmanned aerial vehicles (Unmanned Aerial Vehicle, UAV) have received increasing attention, and can be applied to many fields such as environmental investigation, urban planning, disaster relief and the like. The fusion of visible light and infrared aerial images and target detection are two important tasks in unmanned aerial vehicle application.
Visible light images exhibit rich texture details by capturing the reflected light of the scene, but are often susceptible to light conditions. In contrast, infrared images have a strong anti-interference capability, can capture heat radiation information, and are suitable for various complex environments, but lack detailed texture information. The visible light and infrared image fusion task fully utilizes the complementary information between the two modes through fusion, and generates a fusion image containing more information, so that the performance of other high-level tasks is improved. The target detection task can quickly and accurately detect the target object by utilizing the generated high-quality fusion image and combining an image processing algorithm based on deep learning.
The existing multi-mode image fusion and target detection methods mostly improve the fusion effect by designing a network and introducing constraint conditions, and neglect the potential benefits of the target detection network. Zhao Wenda et al designed a meta-feature embedding model in 2023 that enabled the features of the target detection network to be used to guide the visible and infrared image fusion network. However, the method adopts an internal and external two-stage updating method, the training process is complex, only simple natural images are focused, and complex aerial images are not considered. Different from natural images, the unmanned aerial vehicle has wide field of view, small and dense targets in the shot images, complicated background noise and increased difficulty in target detection.
The progress of the large language basic model GPT-4 excites great attention to basic model development in the field of computer vision. The segmentation basic models such as SAM (SegmentAnything Model) and MobileSAM are used as novel interactive models and are specially designed for image segmentation tasks and subsequent downstream applications, so that a novel solution idea is provided for solving the problems of complex background, small target size and poor fusion and detection effects of the multi-mode aerial images.
Therefore, how to use the segmentation basic model to enhance the detection capability of the target detection branch to small targets in the aerial image, and improve the performance of two tasks of visible light and infrared aerial image fusion and target detection, so as to provide more accurate and efficient data analysis and decision support for unmanned aerial vehicle application, which is a problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a multi-mode aerial image fusion and target detection combined learning method, so as to at least solve some of the technical problems mentioned in the background art.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a multi-mode aerial image fusion and target detection combined learning method comprises the following steps of
S1, adopting paired visible light images and infrared images to preliminarily train image fusion branches to generate a first fusion image;
s2, designing a target detection branch with expert guiding information based on the first fusion image;
s3, performing preliminary training on the target detection branch by adopting the first fusion image; outputting an expert feature map through the target detection branch after preliminary training;
s4, carrying out feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature graph after feature alignment processing to generate a second fusion image;
s5, fine tuning is carried out on the target detection branch by adopting the second fusion image, so that an optimized target detection task is realized.
Further, the step S2 specifically includes:
segmenting the first fused image into a plurality of image blocks using a segmentation base model;
and classifying and coding a plurality of image blocks according to a preset area interval and a segmentation coding module, and adaptively learning coding features by adopting a mixed special gating mechanism to form a target detection branch with expert guiding information.
Further, the setting process of the preset area interval includes:
respectively carrying out normalization processing on the paired visible light images and the target real frames in the infrared images, calculating the areas of the target real frames after the normalization processing, and selecting one area as a first clustering center;
calculating the shortest Euclidean distance between the area of the real frames of other targets and the first clustering center by adopting a K-means++ clustering algorithm; the areas of the other target real frames are the areas of the target real frames except the first clustering center;
calculating the probability that the area of each target real frame is selected as the next clustering center according to the shortest Euclidean distance until K clusters are clustered, and obtaining K sections of area intervals; and taking the K section area interval as a preset area interval.
Further, the partition coding module specifically includes:
respectively calculating the minimum circumscribed rectangular areas of a plurality of image blocks;
dividing the minimum circumscribed rectangular areas into K classes according to the preset area interval, setting the target area of each class to 1, setting other areas to 0, and obtaining a Mask matrix of the K channels;
flattening the Mask matrix, and mapping the flattened Mask matrix into a fixed dimension; based on the Patch embedding coding feature and the Position embedding coding feature, performing self-attention operation by using a transducer encoder to obtain a feature map;
and carrying out downsampling treatment on the feature map by adopting a convolution module to obtain the feature map after downsampling treatment.
Further, the feature map size after the downsampling process is the same as the feature map size of the target detection branch output.
Further, the adaptive learning coding feature adopting the hybrid specialized gating mechanism specifically comprises:
firstly, splicing a feature image output by a target detection branch and a feature image after downsampling along the channel dimension; generating weights from the spliced feature images through a gating network; and finally, carrying out weighted linear combination on the two feature images and the corresponding weights to generate an expert feature image.
Further, the step S4 specifically includes:
performing feature alignment processing on the expert feature map; constructing a first loss function based on the expert feature map after feature alignment processing;
linearly combining the first loss function and the second loss function of the image fusion branch to obtain an image fusion loss function;
optimizing the image fusion loss function through back propagation until an optimal image fusion branch is obtained; and generating a second fusion image based on the optimal image fusion branch.
Compared with the prior art, the invention discloses a multi-mode aerial image fusion and target detection combined learning method, which has the following beneficial effects:
(1) The joint learning method adopted by the invention can not only improve the fusion quality of the visible light and infrared images and the fusion effect, but also improve the target detection performance of the aerial image, thereby realizing the collaborative optimization of the two tasks.
(2) The invention focuses on the target detection task of the multi-mode aerial image, fully plays the advantage that the segmentation basic model can segment everything, builds the Mask matrix through the clustering algorithm to provide expert guidance information for the target detection branch, and enhances the capability of the detector for detecting the small target in the complex working environment of the unmanned aerial vehicle.
(3) According to the invention, knowledge modeling is carried out on the guide information provided by the segmentation basic model by adopting a mixed special gating mechanism, the dependence degree of each expert information can be dynamically selected, the complicated background noise in the aerial image is effectively restrained, and the detector can be more focused on a reliable target object.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a multi-mode aerial image fusion and target detection combined learning method provided by an embodiment of the invention.
Fig. 2 is a schematic diagram of a multi-mode aerial image fusion and target detection combined learning method framework provided by the embodiment of the invention.
Fig. 3 is a schematic diagram of SAM guiding target detection branches according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a SAM segmented image block and a corresponding 3-channel visual Mask matrix according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of an expert feature map of an object detection branch according to an embodiment of the present invention for fine-tuning an image fusion branch through feature alignment.
Fig. 6 is a schematic diagram of a fused image generated by combining optimized image fusion branches according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of a detection effect of a target detection branch after joint optimization according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the embodiment of the invention discloses a multi-mode aerial image fusion and target detection combined learning method, which is implemented based on a segmentation basic model SAM, and the implementation flow comprises the following steps:
s1, adopting paired visible light images and infrared images to preliminarily train image fusion branches to generate a first fusion image;
s2, designing a target detection branch with expert guiding information based on the first fusion image;
s3, performing preliminary training on the target detection branch by adopting a first fusion image; outputting an expert feature map through the target detection branch after preliminary training;
s4, performing feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature map after feature alignment processing to generate a second fusion image;
s5, fine tuning is carried out on the target detection branch by adopting the second fusion image, so that the target detection task is optimized.
The respective steps described above are explained next.
In the above step S1, referring to fig. 2, the image fusion branch includes a plurality of feature fusion modules and an image reconstruction module; in the training process, firstly, paired visible light images and infrared images are input into a plurality of feature fusion modules, the feature fusion modules respectively and sequentially extract the image features of the visible light images and the infrared images, and image feature fusion processing is carried out to obtain image fusion features; then an image reconstruction module is adopted to construct a fusion image based on the obtained image fusion characteristics, namely a first fusion image;
in the embodiment of the invention, the image fusion branch comprises 3 feature fusion modules and 1 image reconstruction module; each feature fusion module consists of 23×3 convolutions and a Relu activation function, and features are fused among the modules in a splicing manner; the image reconstruction module consists of 6 3 x 3 convolutions and a Relu activation function.
In the above step S2, referring specifically to fig. 3, the first fused image generated in step S1 is segmented into a plurality of image blocks using a segmentation base model; classifying and coding a plurality of image blocks according to a preset area interval and a segmentation coding module, and adaptively learning coding features by adopting a mixed special gating mechanism to form a target detection branch with expert guiding information;
the setting process of the preset area interval comprises the following steps:
(1) Respectively carrying out normalization processing on target real frames in the paired visible light images and infrared images; expressed by the formula:
wherein w represents the width of the target real frame; h represents the height of the target real frame; w represents the width of a visible light image or an infrared image; h represents the high of the visible light image or the infrared image;
then calculating the area of the target real frame after normalization processing, and selecting one area as a first clustering center;
(2) Calculating the shortest Euclidean distance d (s, c) between the area of the real frame of other targets and the first clustering center by adopting a K-means++ clustering algorithm; the areas of other target real frames are the areas of the target real frames except the first clustering center; the shortest euclidean distance d (s, c) is expressed as:
wherein s is n And c n An area vector representing the target real frame of the nth dimension; n represents the dimension of the vector;
(3) Calculating the probability P(s) that the area of each target real frame is selected as the next clustering center according to the shortest Euclidean distance until K clusters are clustered out, and obtaining K sections of area intervals; taking the K section area interval as a preset area interval;
the probability P(s) of being selected as the next cluster center is expressed as:
where c' represents other cluster centers.
The above-mentioned segmentation encoding module specifically includes:
(1) Respectively calculating the minimum circumscribed rectangular areas of the image blocks; dividing a plurality of minimum circumscribed rectangular areas into K classes according to a preset area interval, setting a target area of each class to be 1, setting other areas to be 0, and obtaining a Mask matrix of a K channel; FIG. 4 is a SAM segmented image block and corresponding 3-channel visualization Mask matrix;
(2) Flattening the Mask matrix, and mapping the flattened Mask matrix into a fixed dimension; based on the Patch embedding coding feature and the Position embedding coding feature, performing self-Attention operation Attention by using a transducer encoder to obtain a feature map; expressed as:
K=LN(Reshape(k,R 1 )·W) (5)
V=LN(Reshape(v,R 2 )·W) (6)
wherein Q represents a query; k represents keying; v represents a key value; LN representation layer normalization; r is R 1 And R is 2 All represent a reduction ratio; w represents a linear mapping; d represents a scale factor.
(3) Performing downsampling processing on the feature map by adopting a convolution module to obtain a feature map after downsampling processing; the feature map size after the downsampling process is the same as the feature map size of the target detection branch output.
The method adopts the self-adaptive learning coding characteristic of the mixed special control mechanism to provide expert guiding information for the target detection branch, and specifically comprises the following steps:
firstly, splicing a feature image output by a target detection branch and a feature image after downsampling along the channel dimension; generating weights by the gating unit from the spliced feature images; finally, carrying out weighted linear combination on the two feature images and the corresponding weights to generate an expert feature image; the whole process is expressed as:
f i =concatenate(F di ,F si ) (8)
ω i =g(flatten(f i )) (9)
F mi =ω i ·F di +(1-ω i )F si (10)
wherein i=1, 2,3,4, f di A feature map representing the output of the target detection branch; f (F) si Representing a feature map after the downsampling process; g represents a gating unit function; f (f) i Representing the spliced characteristic diagram; w (w) i Representing the weight; f (F) mi Representing expert feature graphs.
In the above step S3, referring to fig. 2, the target detection branch includes a backbone network (using res net 50), a feature pyramid network, a hybrid private gate, and a detection head; in the training process, inputting a first fusion image into a backbone network; extracting image features of the first fused image on different scales by a backbone network, and fusing the image features of different scales by a feature pyramid network to obtain a fused feature map with high resolution and strong semantics; inputting the fusion feature map into a mixed special gating network, and realizing target classification and regression through a detection head by combining expert guidance information;
the target detection branch also comprises a SAM (Segment Anything Model, dividing all models) and a dividing encoding module; inputting the first fusion image into a SAM, and dividing the first fusion image into a plurality of image blocks; the segmentation coding module carries out coding processing on the segmented image blocks, so that the semantic effect of the feature map is further enhanced; and then, the feature images output by the segmentation coding module are input into a mixed special gating network through downsampling, and expert guiding information is provided for the target detection branch.
In the step S4, the features of the image fusion task and the target detection task are incompatible due to the difference between the two tasks, and the features output by the target detection branch cannot be directly used for assisting the image fusion branch, so that the features of the two tasks can be matched by adopting the feature alignment module; in the embodiment of the invention, the expert feature map is subjected to feature alignment processing by a feature alignment module; and constructing a first loss function L based on the expert feature graph after feature alignment processing g To fine-tune the image fusion branch to generate more target semantic information, see in particular fig. 5;
the step S4 specifically comprises the following steps:
constructing a first loss function L based on an expert feature map of target detection branch output after preliminary training g The method comprises the steps of carrying out a first treatment on the surface of the First loss function L g Is a SmoothL1 loss function; expressed as:
wherein F is ui Representing the characteristics output by the characteristic alignment module; f (F) vi Features representing image fusion branches;
acquiring a second loss function L of an image fusion branch f
Linearly combining the first loss function and the second loss function to obtain an image fusion loss function L;
continuously optimizing the image fusion loss function L through back propagation until an optimal image fusion branch is obtained; expressed as:
L=L f +λL g (12)
L f =θ 1 ·(1-SSIM(I f ,I 1 ))+θ 2 ·(1-SSIM(I f ,I 2 )) (13)
wherein lambda, theta 1 And theta 2 All represent weight ratios; SSIM represents a structural similarity penalty; i 1 Representing a visible light image; i 2 Representing an infrared image; i f Representing the first fused image or the second fused image; mu (mu) 1 An average pixel representing a visible light image; mu (mu) 2 Average pixels representing the infrared image; mu (mu) f Average pixels representing the first fused image or the second fused image; sigma (sigma) 1 Representing the standard deviation of pixels of the visible light image; sigma (sigma) 2 Representing the pixel standard deviation of the infrared image; sigma (sigma) f Representing a standard deviation of pixels of the first fused image or the second fused image; sigma (sigma) f1 Representing pixel covariance of the fused image and the visible image; sigma (sigma) f2 Is the pixel covariance of the fused image and the infrared image; the fusion image is a first fusion image or a second fusion image; k (k) 1 And k 2 All represent constants.
After fine tuning of the image fusion branch, generating a second fusion image; since the image fusion branch has completed the optimization, the quality of the generated second fusion image is higher, see in particular fig. 6.
In the step S5, the second fused image is adopted to further fine tune the target detection branch, so as to realize the target detection task optimization, and the specific target detection effect is shown in fig. 7; it can be seen from fig. 7 that the trimmed target detection branches can accurately distinguish the easily confused target categories, and can still identify small targets even in the case of insufficient light.
In summary, the embodiment of the invention provides a multi-mode aerial image fusion and target detection joint learning method, which utilizes expert guidance information provided by a segmentation basic model to enhance the detection capability of target detection branches on small targets in an aerial image, and uses the guided target detection branches for assisting an image fusion branch to generate more target semantic information. The method can simultaneously improve the performance of two tasks of visible light and infrared aerial image fusion and target detection, and provides more accurate and efficient data analysis and decision support for unmanned aerial vehicle application.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A multi-mode aerial image fusion and target detection combined learning method is characterized by comprising the following steps of
S1, adopting paired visible light images and infrared images to preliminarily train image fusion branches to generate a first fusion image;
s2, designing a target detection branch with expert guiding information based on the first fusion image;
s3, performing preliminary training on the target detection branch by adopting the first fusion image; outputting an expert feature map through the target detection branch after preliminary training;
s4, carrying out feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature graph after feature alignment processing to generate a second fusion image;
s5, fine tuning is carried out on the target detection branch by adopting the second fusion image, so that an optimized target detection task is realized.
2. The multi-mode aerial image fusion and target detection joint learning method according to claim 1, wherein the step S2 specifically includes:
segmenting the first fused image into a plurality of image blocks using a segmentation base model;
and classifying and coding a plurality of image blocks according to a preset area interval and a segmentation coding module, and adaptively learning coding features by adopting a mixed special gating mechanism to form a target detection branch with expert guiding information.
3. The multi-mode aerial image fusion and target detection joint learning method according to claim 2, wherein the setting process of the preset area interval comprises the following steps:
respectively carrying out normalization processing on the paired visible light images and the target real frames in the infrared images, calculating the areas of the target real frames after the normalization processing, and selecting one area as a first clustering center;
calculating the shortest Euclidean distance between the area of the real frames of other targets and the first clustering center by adopting a K-means++ clustering algorithm; the areas of the other target real frames are the areas of the target real frames except the first clustering center;
calculating the probability that the area of each target real frame is selected as the next clustering center according to the shortest Euclidean distance until K clusters are clustered, and obtaining K sections of area intervals; and taking the K section area interval as a preset area interval.
4. The multi-modal aerial image fusion and target detection joint learning method according to claim 2, wherein the segmentation encoding module specifically comprises:
respectively calculating the minimum circumscribed rectangular areas of a plurality of image blocks;
dividing the minimum circumscribed rectangular areas into K classes according to the preset area interval, setting the target area of each class to 1, setting other areas to 0, and obtaining a Mask matrix of the K channels;
flattening the Mask matrix, and mapping the flattened Mask matrix into a fixed dimension; based on the Patch embedding coding feature and the Position embedding coding feature, performing self-attention operation by using a transducer encoder to obtain a feature map;
and carrying out downsampling treatment on the feature map by adopting a convolution module to obtain the feature map after downsampling treatment.
5. The multi-modal aerial image fusion and target detection joint learning method of claim 4, wherein the feature map size after the downsampling process is the same as the feature map size of the target detection branch output.
6. The multi-modal aerial image fusion and target detection joint learning method according to claim 4, wherein the adaptive learning coding feature by adopting a hybrid specialized gating mechanism specifically comprises:
firstly, splicing a feature image output by a target detection branch and a feature image after downsampling along the channel dimension; generating weights from the spliced feature images through a gating network; and finally, carrying out weighted linear combination on the two feature images and the corresponding weights to generate an expert feature image.
7. The multi-mode aerial image fusion and target detection joint learning method according to claim 1, wherein the step S4 specifically includes:
performing feature alignment processing on the expert feature map; constructing a first loss function based on the expert feature map after feature alignment processing;
linearly combining the first loss function and the second loss function of the image fusion branch to obtain an image fusion loss function;
optimizing the image fusion loss function through back propagation until an optimal image fusion branch is obtained; and generating a second fusion image based on the optimal image fusion branch.
CN202311058440.8A 2023-08-22 2023-08-22 Multi-mode aerial image fusion and target detection combined learning method Pending CN116994164A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311058440.8A CN116994164A (en) 2023-08-22 2023-08-22 Multi-mode aerial image fusion and target detection combined learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311058440.8A CN116994164A (en) 2023-08-22 2023-08-22 Multi-mode aerial image fusion and target detection combined learning method

Publications (1)

Publication Number Publication Date
CN116994164A true CN116994164A (en) 2023-11-03

Family

ID=88531998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311058440.8A Pending CN116994164A (en) 2023-08-22 2023-08-22 Multi-mode aerial image fusion and target detection combined learning method

Country Status (1)

Country Link
CN (1) CN116994164A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117274826A (en) * 2023-11-23 2023-12-22 山东锋士信息技术有限公司 River and lake management violation problem remote sensing monitoring method based on large model and prompt guidance

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117274826A (en) * 2023-11-23 2023-12-22 山东锋士信息技术有限公司 River and lake management violation problem remote sensing monitoring method based on large model and prompt guidance
CN117274826B (en) * 2023-11-23 2024-03-08 山东锋士信息技术有限公司 River and lake management violation problem remote sensing monitoring method based on large model and prompt guidance

Similar Documents

Publication Publication Date Title
Zou et al. Robust lane detection from continuous driving scenes using deep neural networks
CN110363122B (en) Cross-domain target detection method based on multi-layer feature alignment
CN109902806B (en) Method for determining target bounding box of noise image based on convolutional neural network
CN111368671A (en) SAR image ship target detection and identification integrated method based on deep learning
CN117079139B (en) Remote sensing image target detection method and system based on multi-scale semantic features
CN111539255A (en) Cross-modal pedestrian re-identification method based on multi-modal image style conversion
CN110390308B (en) Video behavior identification method based on space-time confrontation generation network
CN110287798B (en) Vector network pedestrian detection method based on feature modularization and context fusion
CN113159043A (en) Feature point matching method and system based on semantic information
CN116052095B (en) Vehicle re-identification method for smart city panoramic video monitoring
CN116994164A (en) Multi-mode aerial image fusion and target detection combined learning method
CN116385761A (en) 3D target detection method integrating RGB and infrared information
CN110826411A (en) Vehicle target rapid identification method based on unmanned aerial vehicle image
CN116524189A (en) High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization
CN112861840A (en) Complex scene character recognition method and system based on multi-feature fusion convolutional network
CN117830788B (en) Image target detection method for multi-source information fusion
CN117456480B (en) Light vehicle re-identification method based on multi-source information fusion
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN117422971A (en) Bimodal target detection method and system based on cross-modal attention mechanism fusion
CN113160219B (en) Real-time railway scene analysis method for unmanned aerial vehicle remote sensing image
CN111160282B (en) Traffic light detection method based on binary Yolov3 network
CN117058641A (en) Panoramic driving perception method based on deep learning
CN117173595A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7
CN115797884A (en) Vehicle weight identification method based on human-like visual attention weighting
CN115359376A (en) Pedestrian detection method of lightweight YOLOv4 under view angle of unmanned aerial vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination