CN116958782A - Method and device for detecting weak and small targets by combining infrared and visible light characteristics - Google Patents

Method and device for detecting weak and small targets by combining infrared and visible light characteristics Download PDF

Info

Publication number
CN116958782A
CN116958782A CN202310813236.6A CN202310813236A CN116958782A CN 116958782 A CN116958782 A CN 116958782A CN 202310813236 A CN202310813236 A CN 202310813236A CN 116958782 A CN116958782 A CN 116958782A
Authority
CN
China
Prior art keywords
image
visible light
features
infrared
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310813236.6A
Other languages
Chinese (zh)
Inventor
李宏菲
臧义华
杨晓梅
李奕洁
杨馨雨
马兴民
李雪扬
陈莉莉
凡叔军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202310813236.6A priority Critical patent/CN116958782A/en
Publication of CN116958782A publication Critical patent/CN116958782A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a method and a device for detecting a weak and small target by fusing infrared and visible light characteristics, wherein the method comprises the following steps: respectively preprocessing an infrared image and a visible light image, wherein the infrared image and the visible light image comprise at least one detection target; respectively extracting image features of an infrared image and a visible light image by using the configured double-flow backbone network and carrying out feature fusion to obtain multi-scale image features; based on the multi-scale image characteristics, the configured positioning classification model is utilized to obtain the position information and the classification result of the detection target. According to the method provided by the embodiment, feature extraction is achieved on two-mode data of infrared light and visible light through the double-flow backbone network model, complementation of the two-mode features is achieved through cross-mode multi-level inter-stage feature fusion, positioning and classification of weak and small targets are achieved through a deep learning-based network, and detection accuracy of the assisted weak and small targets is improved.

Description

Method and device for detecting weak and small targets by combining infrared and visible light characteristics
Technical Field
The application relates to the technical field of infrared and visible light image detection, in particular to a method and a device for detecting a weak and small target by fusing infrared and visible light characteristics.
Background
The feature fusion of the infrared and visible light images means that after the features of the infrared and visible light images are extracted through a network based on deep learning, the features of the same shallow level or deep level are fused through a fusion module, and by utilizing the complementarity between the features and the advantages between the fused features, more position information and semantic information are obtained, and the accuracy of positioning and identifying small targets of a model is further improved.
The detection and classification of the infrared weak and small targets refers to effectively distinguishing background areas (such as trees, sea waves, buildings, sky and the like) from weak and small target areas (such as unmanned aerial vehicles and the like) in an infrared image obtained through shooting, and marking the target areas and completing classification of the weak and small targets.
The traditional infrared dim target detection algorithm realizes the detection process mainly by suppressing the background and enhancing the target, and when the background of the infrared image is complex or the signal to noise of the target is low, the algorithm easily generates more false alarms, and the detection accuracy is low. Many deep learning-based methods typically use generic object detection directly or the detection effect of semantic segmentation networks is not ideal.
Disclosure of Invention
The technical scheme adopted by the application is that how to solve the problem of poor detection precision caused by factors such as insignificant characteristics of the infrared dim and small targets in the detection of the infrared dim and small targets under a complex background; in view of the above, the application provides a method and a device for detecting a weak and small target by combining infrared and visible light characteristics.
The technical scheme of the application is that the method for detecting the weak and small target by fusing infrared and visible light features comprises the following steps:
step S1, respectively preprocessing an infrared image and a visible light image, wherein the infrared image and the visible light image comprise at least one detection target;
step S2, respectively extracting image features of the infrared image and the visible light image by using the configured double-flow backbone network and carrying out feature fusion to obtain multi-scale image features;
and step S3, based on the multi-scale image characteristics, acquiring the position information and the classification result of the detection target by using a configured positioning classification model.
In one embodiment, the step S1 includes:
acquiring the infrared image and the visible light image;
performing differential operation on the infrared image and the blurred image after low-pass filtering based on a configured non-sharpening masking edge and detail enhancement algorithm to obtain high-frequency components representing the edge and detail of the infrared image, multiplying the high-frequency components with a configured gain coefficient, and then overlapping an original image to obtain an infrared image with enhanced target edge and detail;
and filtering the visible light image by using histogram equalization, enhancing the filtered visible light image, increasing the number of samples, and scaling the current visible light image based on a configured bilinear difference algorithm.
In one embodiment, the step S2 includes:
extracting features of the infrared image and the visible light image by using a configured ResNet network, wherein the extracted features comprise shallow features and high-level features;
the spatial information of the target area corresponding to the image is enhanced by using the configured spatial information guiding module, and shallow layer characteristics of the two images are guided to be fused;
at least one of SE module or coordinate attention module is adopted to strengthen semantic information of high-level features, expansion convolution with different expansion rates is adopted to capture multi-scale context information of the two-mode high-level features, and the multi-scale context information is spliced with the strengthened high-level features to obtain the high-level features after the two-mode enhancement;
and fusing the current shallow layer characteristics and the high layer characteristics to obtain multi-scale image characteristics.
In one embodiment, the step S3 includes:
acquiring the high-level image characteristics;
and acquiring the position information and the classification result of the detection target by using a configured positioning classification model, wherein the positioning classification model is determined based on two parts of contents of the RPN and the head structure.
In another aspect, the present application provides a weak and small target detection device with integrated infrared and visible light features, including:
the preprocessing unit is configured to respectively preprocess an infrared image and a visible light image, wherein the infrared image and the visible light image comprise at least one detection target;
the fusion unit is configured to extract image features of the infrared image and the visible light image respectively by using the configured double-flow backbone network and perform feature fusion so as to obtain multi-scale image features;
and the detection unit is configured to acquire the position information and the classification result of the detection target by using a configuration positioning classification model based on the multi-scale image characteristics.
In one embodiment, the preprocessing unit is further configured to:
acquiring the infrared image and the visible light image;
performing differential operation on the infrared image and the blurred image after low-pass filtering based on a configured non-sharpening masking edge and detail enhancement algorithm to obtain high-frequency components representing the edge and detail of the infrared image, multiplying the high-frequency components with a configured gain coefficient, and then overlapping an original image to obtain an infrared image with enhanced target edge and detail;
and filtering the visible light image by using histogram equalization, enhancing the filtered visible light image, increasing the number of samples, and scaling the current visible light image based on a configured bilinear difference algorithm.
In one embodiment, the fusion unit is further configured to:
extracting features of the infrared image and the visible light image by using a configured ResNet network, wherein the extracted features comprise shallow features and high-level features;
the spatial information of the target area corresponding to the image is enhanced by using the configured spatial information guiding module, and shallow layer characteristics of the two images are guided to be fused;
at least one of SE module or coordinate attention module is adopted to strengthen semantic information of high-level features, expansion convolution with different expansion rates is adopted to capture multi-scale context information of the two-mode high-level features, and the multi-scale context information is spliced with the strengthened high-level features to obtain the high-level features after the two-mode enhancement;
and fusing the current shallow layer characteristics and the high layer characteristics to obtain multi-scale image characteristics.
In one embodiment, the detection unit is further configured to:
acquiring the high-level image characteristics;
and acquiring the position information and the classification result of the detection target by using a configured positioning classification model, wherein the positioning classification model is determined based on two parts of contents of the RPN and the head structure.
Another aspect of the present application also provides an electronic device including: a memory, a processor, and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method for detecting a small target with fusion of infrared and visible light features as defined in any one of the above.
Another aspect of the present application also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for detecting a dim target with fusion of infrared and visible light features as described in any one of the above.
By adopting the technical scheme, the application has at least the following advantages:
according to the method provided by the embodiment of the application, the feature extraction is realized on the infrared light and visible light two-mode data through the double-flow backbone network model, the complementation of the two-mode features is realized through the cross-mode multi-level inter-stage feature fusion, the positioning and classification of the weak and small targets are realized by using a network based on deep learning, and the detection precision of the assisted weak and small targets is improved.
Drawings
FIG. 1 is a schematic flow chart of a method for detecting a weak target by fusing infrared and visible light features according to an embodiment of the application;
FIG. 2 is a block diagram of a dual-stream cross-modality inter-layer fusion feature extraction backbone network according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a multi-receptive field feature enhancement module in accordance with embodiments of the application;
FIG. 4 is a schematic diagram of a hybrid attention module according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a cross-modal fusion module guided by shallow spatial information according to an embodiment of the application;
FIG. 6 is a schematic diagram of a coordinate attention module architecture according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a high level semantic information guided fusion (CFF) module architecture according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a multi-scale feature fusion network architecture with semantic and spatial information enhancement according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a semantic enhancement module (CEM) according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a Spatial Enhancement Module (SEM) according to an embodiment of the present application;
FIG. 11 is a schematic diagram of an infrared dim target positioning and classifying model according to an embodiment of the present application;
FIG. 12 is a flow chart of infrared dim target detection and classification according to an embodiment of the present application;
FIG. 13 is a block diagram of a small and weak object detection device with integrated infrared and visible light features according to an embodiment of the present application;
fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to further describe the technical means and effects adopted by the present application for achieving the intended purpose, the following detailed description of the present application is given with reference to the accompanying drawings and preferred embodiments.
In the drawings, the thickness, size and shape of the object have been slightly exaggerated for convenience of explanation. The figures are merely examples and are not drawn to scale.
It will be further understood that the terms "comprises," "comprising," "includes," "including," "having," "containing," and/or "including," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, when a statement such as "at least one of the following" appears after a list of features that are listed, the entire listed feature is modified instead of modifying a separate element in the list. Furthermore, when describing embodiments of the present application, the use of "may" means "one or more embodiments of the present application. Also, the term "exemplary" is intended to refer to an example or illustration.
As used herein, the terms "substantially," "about," and the like are used as terms of a table approximation, not as terms of a table level, and are intended to illustrate inherent deviations in measured or calculated values that would be recognized by one of ordinary skill in the art.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
In a first embodiment of the present application, a method for detecting a small target by combining infrared and visible light features, as shown in fig. 1, includes the following steps:
step S1, respectively preprocessing an infrared image and a visible light image, wherein the infrared image and the visible light image comprise at least one detection target;
step S2, respectively extracting image features of the infrared image and the visible light image by using the configured double-flow backbone network and carrying out feature fusion to obtain multi-scale image features;
and step S3, based on the multi-scale image characteristics, acquiring the position information and the classification result of the detection target by using a configured positioning classification model.
The method provided by the application will be described in detail in steps.
Step S1, respectively preprocessing an infrared image and a visible light image, wherein the infrared image and the visible light image comprise at least one detection target.
In this embodiment, step S1 further includes:
s101, acquiring the infrared image and the visible light image;
s102, based on a configured non-sharpening masking edge and detail enhancement algorithm, carrying out differential operation on the infrared image and the blurred image after low-pass filtering to obtain high-frequency components representing the edge and detail of the infrared image, multiplying the high-frequency components with a configured gain coefficient, and then superposing an original image to obtain an infrared image with enhanced target edge and detail;
and S103, filtering the visible light image by using histogram equalization, enhancing the filtered visible light image, increasing the number of samples, and scaling the current visible light image based on a configured bilinear difference algorithm.
Specifically, due to the influence of atmospheric interference in the infrared imaging process, the radiation energy of the target is reduced to cause the ultra-Low signal-to-noise ratio of the infrared image, the edges and the details of the target are helpful to the detection of the weak target, and the edges and the details of the target are helpful to the detection of the weak target.
Accordingly, for visible light images, histogram equalization is used to filter the images, rotation, random clipping, etc. are used to enhance the images to increase the number of samples, and then bilinear difference (Bilinear Interpolation) is used to scale the images as low resolution images.
And S2, respectively extracting the image features of the infrared image and the visible light image by using the configured double-flow backbone network and carrying out feature fusion to obtain multi-scale image features.
In this embodiment, step S2 further includes:
s201, extracting features of the infrared image and the visible light image by using a configured ResNet network, wherein the extracted features comprise shallow features and high-level features;
s202, enhancing the spatial information of a target area corresponding to an image by using a configured spatial information guiding module, and guiding shallow features of the two images to be fused;
s203, adopting at least one of an SE module or a coordinate attention module to strengthen semantic information of high-level features, adopting expansion convolution with different expansion rates to capture multi-scale context information of the two-mode high-level features, and splicing the multi-scale context information with the strengthened high-level features to obtain the high-level features after the two-mode enhancement;
s204, fusing the current shallow layer characteristics and the high layer characteristics to obtain multi-scale image characteristics.
Specifically, in order to solve the problem of poor detection precision caused by lack of information of infrared weak and small targets, by virtue of the advantage that a visible light image has abundant texture information, a 'double-flow backbone network' is designed to respectively realize feature extraction on two mode data, and complementation of the two mode features is realized through multi-layer inter-stage feature fusion of cross modes, so that the detection precision of the assisted weak and small targets is improved; in the stage of fusion of shallow and high-level features, a multi-scale semantic enhancement module is designed for enhancing semantic information of a weak and small target; finally, the positioning and distinguishing of the weak and small targets are converted into regression and classification problems through designing a Loss function.
1) Feature extraction
In this embodiment, referring to fig. 2, considering that the res net series network can better fit the distribution function, obtain higher detection precision and be beneficial to training and learning of the network, the res net network series (such as res net34 or res net 50) is to be used as the basic skeleton of the two-mode image feature extraction network, and a multi-view feature enhancement module and an attention mechanism module are additionally arranged on the basic skeleton, so that the feature extraction capability of the backbone network is further enhanced.
Because the size of the spatial distribution presented by the weak and small targets in the real scene has diversity and difference, the information contained in the local area size is helpful for detecting the small targets, and the multi-receptive field feature enhancement module (shown in fig. 3) is designed to enrich the features of the weak and small targets. Will shallow layer featureDelivering into several residual modules with different convolution kernel sizes to obtain corresponding multi-scale characteristic diagram +.>And splicing in the channel dimension to realize the fusion of multi-scale characteristics>. Although a larger convolution kernel can increase the receptive field of the network, there is a risk of misextracting the weak object as background, for which the number of residual modules and the size of the convolution kernel are designed according to the spatial size occupied by the weak object in the image.
To prevent loss of important detail texture information, a series of hybrid attention modules (as shown in fig. 4) is constructed with the aid of channel attention (channel attention, CA) and spatial attention (spatial attention, SA). The mixed attention module is added after the multi-view feature enhancement module to selectively focus on key feature information on the space or channel while suppressing non-valuable features.
2) Cross-modal interlayer fusion
For features from the same shallow level of both modality imagesAnd->To enhance the expressive power of the features, the channel dimension of the features can be extended by a 1×1 convolution, so that a Spatial Attention (SA) mechanism is introduced to enhance the spatial information of the target region, and the shallow features of the two mode images are guided to be fused.
The spatial information guide (SAD) unit can be derived from SAM module in CBAM, generates two spatial feature images representing different information by Average Pooling (AP) and Maximum Pooling (MP), combines, performs feature fusion by 7×7 convolution with larger receptive field, generates weight image (Ws) by Sigmoid (sigma) operation, and outputs corrected feature image. The SAM module can be simplified, and only AP or MP operation is adopted, and a weight map (Ws) is generated by using K X K convolution with larger receptive field, so that the enhancement of target areas of two modal images is realized.
Or (b)
Or-> (1)
In view of the fact that the two modes have important shallow space information, a symmetrical structure is adopted to realize cross-mode shallow feature fusion under SAD guidance (as shown in fig. 5), namely:
(2)
operation of element addition between two modal featuresOperation of element multiplication emphasizing the complementarity of the features>Emphasizing the commonality of features, BP units consisting of BN (Batch Normalization) and an activation function (e.g., reLu, PReLu or Swish ReLu) aim to improve the non-linear nature of shallow features through low cost calculations.
Unlike shallow features, high-level features have rich semantic information and lack spatial information, so that in the high-level cross-modal fusion, a Channel Attention (CA) mechanism represented by an SE module or a coordinate attention module is adopted to enhance the semantic information of the high-level features. The reason for using the coordinate attention module (as shown in fig. 6) is that it can capture not only cross-channel information, but also direction-aware and position-sensitive information through an averaging pooling operation in both horizontal and vertical directions, which helps the model to more accurately locate and identify small objects.
In addition, considering that the context information of the target neighborhood is helpful for detecting small targets, for example, birds are difficult to identify when no sky information exists, but the sky is easy to distinguish when the sky is used as the context information, the cavity (arranged) convolution can enhance the semantic information by expanding the receptive field, and therefore, the multi-scale context information of two modal high-level features is captured by adopting expansion convolution with several different expansion ratesAnd is enhanced with CA mechanism +.>Phase-spliced CAD units, as shown in FIG. 7 (a), obtain high-level semantic features ++>
(3)
Further, similar to the shallow cross-modal fusion approach, a symmetrical structure is also employed to achieve cross-modal fusion of high-level semantic information guidance, as shown in fig. 7 (b).
(4)
3) Multi-scale feature fusion network
In order to realize the fusion of shallow space information and high-level semantic information, the feature that the high-level features contain abundant semantic information is used for embedding a semantic enhancement module (CEM) in the high-level feature transmission based on a pyramid (FPN) structure, the feature that the shallow features contain abundant space information is used for embedding a Space Enhancement Module (SEM) with a super-resolution function in the shallow feature transmission (as shown in fig. 8), the problem that the space and the semantic information of a weak and small target are easily submerged is solved, and the detection precision is improved.
In view of the fact that the context information of the surrounding area of the target can promote the detection of a weak and small target, a semantic enhancement module (CEM) is embedded between a high-level C5 with strong semantic information and P5 in the FPN, and multi-scale context information of different receptive fields is obtained by adopting cavity convolution with different expansions r and is realized by using various fusion modes. As shown in fig. 9 (a), after the C5 is 'parallel coded' through the hole convolutions of different r, an addition fusion mode is adopted, or a splicing fusion mode is adopted as shown in fig. 9 (b), or a serial-parallel cascade fusion mode of fig. 9 (C) is adopted, semantic information with smaller visual field is gradually transferred to semantic information with larger visual field through a serial mode, and a jump connection mode is adopted, so that interaction between output features before and after each level of hole convolutions is enhanced, feature fusion is enhanced, and on the other hand, calculation complexity is reduced through the jump connection mode, and convergence and reasoning speed is accelerated.
SEM modules are designed to generate high resolution features from low resolution images to support detection of small target objects while maintaining low computational costs. Shallow features P_2 of the pyramid (FPN) (see FIG. 10) containing rich spatial information are used as references, and are mixed with adjacent high-level features P_3 in the SEM module to generate intermediate features P_3' which are beneficial to the detection of weak and small objects, and can be expressed mathematically as:
(5)
wherein, the liquid crystal display device comprises a liquid crystal display device,and->Respectively representing texture and semantic extraction operations, +.2x represents an upsampling operation implemented with sub-pixel convolution. Shallow feature map considering pyramid (FPN)>The smaller receptive field in the middle helps better locate the characteristics of weak and small objectsThus, in generating intermediate features->Afterwards, will->And->High-resolution feature map of the same class->Superimposed, generating a pyramid layer with enhanced spatial information +.>The method is used for positioning and classifying the subsequent weak and small targets. Feature map->The generation of (c) can be expressed mathematically as:
(6)
and step S3, based on the multi-scale image characteristics, acquiring the position information and the classification result of the detection target by using a configured positioning classification model.
In this embodiment, step S3 further includes:
s301, acquiring the high-level image features;
s302, acquiring position information and a classification result of a detection target by using a configured positioning classification model, wherein the positioning classification model is determined based on two parts of contents of an RPN and a head structure.
Specifically, referring to fig. 11, considering that the model is to be deployed on a mobile/embedded or edge device, there are high requirements on the detection speed, so three structural layer designs of RPN (RegionProposal Network) of Faster RCNN (Faster Regions with CNN Features), roI Pooling (Region of Interest Pooling), classification and regression (Classificationand Regression) are introduced in the positioning and classification stage based on the feature map obtained in the feature extraction and feature fusion stage. The RPN utilizes an Anchor mechanism to correlate the region generation with the convolution network, preliminarily screens candidate frames containing targets and judging the foreground and the background, and further, the classification regression layer outputs the final prediction boundary frame and the target category.
The target detection task is not only required to complete classification of the target, but also required to complete classification of the target object. The loss function of the localization and classification model contains two branches: RPN and header structure. Wherein, the loss function of the RPN is designed as follows:
(7)
wherein, the prediction probability that the ith anchor frame is the foreground is represented; * a label which is true image, when the anchor frame is positive sample 1 and negative sample 0; ti represents a vector of parameterized coordinates of a prediction bounding box of the i-th anchor box; t is t i * Representing a real boundary box corresponding to the ith positive anchor box; n (N) cls Is the size of a small batch, which can be set to 3 in the formula; n (N) reg The number of the positions is represented, and the number can be set to 10 in the formula; λ is a balance parameter, which can be set to 10 in this formula.
The classification loss function of the RPN is:
(8)
the regression loss function of the RPN is:
(9)
(10)
in the head structure, the application introduces a mean square error (Mean Square Error, MSE) as a classification loss function of the head structure as:
(11)
the four loss functions from the RPN and the head structure are integrated, and the total loss function for detecting the weak and small targets of the project is designed:
(12)
in summary, the detection of the infrared weak and small target can be roughly divided into the three steps, firstly, image preprocessing is performed on an infrared image and a visible light image respectively, then, features of two mode images are extracted through a double-flow backbone network and fused to obtain high-level features, and finally, the detection and classification of the target are realized through a positioning/classifying model, and a flow chart is shown in fig. 12.
It should be noted that, the infrared weak and small target detection algorithm can be used after training, and the corresponding algorithm training flow is as follows:
1) Input: test dataset= (input image set, classification tag set); model training parameters= (training times, initial learning rate)
2) Data set processing: enhancing the data set; dividing a training set, a verification set and a test set; and labeling the training set to generate a labeling set.
3) Model training: adopting an ADAM optimization algorithm to adaptively adjust the learning rate; determining a training stopping condition according to the Loss function Focal Loss; outputting a weight set after network training; finally, the trained weight is imported, and the infrared image to be detected is input.
4) And (3) outputting: and (5) position information and classification results of the infrared weak and small targets.
Compared with the prior art, the embodiment has at least the following advantages:
according to the method provided by the embodiment, feature extraction is achieved on two-mode data of infrared light and visible light through the double-flow backbone network model, complementation of the two-mode features is achieved through cross-mode multi-level inter-stage feature fusion, positioning and classification of weak and small targets are achieved through a deep learning-based network, and detection accuracy of the assisted weak and small targets is improved.
The second embodiment of the present application, corresponding to the first embodiment, with reference to fig. 13, introduces a weak and small target detection device with integrated infrared and visible light features, and includes the following components:
the preprocessing unit is configured to respectively preprocess an infrared image and a visible light image, wherein the infrared image and the visible light image comprise at least one detection target;
the fusion unit is configured to extract image features of the infrared image and the visible light image respectively by using the configured double-flow backbone network and perform feature fusion so as to obtain multi-scale image features;
and the detection unit is configured to acquire the position information and the classification result of the detection target by using a configuration positioning classification model based on the multi-scale image characteristics.
In this embodiment, the preprocessing unit is further configured to:
acquiring the infrared image and the visible light image;
performing differential operation on the infrared image and the blurred image after low-pass filtering based on a configured non-sharpening masking edge and detail enhancement algorithm to obtain high-frequency components representing the edge and detail of the infrared image, multiplying the high-frequency components with a configured gain coefficient, and then overlapping an original image to obtain an infrared image with enhanced target edge and detail;
and filtering the visible light image by using histogram equalization, enhancing the filtered visible light image, increasing the number of samples, and scaling the current visible light image based on a configured bilinear difference algorithm.
In this embodiment, the fusion unit is further configured to:
extracting features of the infrared image and the visible light image by using a configured ResNet network, wherein the extracted features comprise shallow features and high-level features;
the spatial information of the target area corresponding to the image is enhanced by using the configured spatial information guiding module, and shallow layer characteristics of the two images are guided to be fused;
at least one of SE module or coordinate attention module is adopted to strengthen semantic information of high-level features, expansion convolution with different expansion rates is adopted to capture multi-scale context information of the two-mode high-level features, and the multi-scale context information is spliced with the strengthened high-level features to obtain the high-level features after the two-mode enhancement;
and fusing the current shallow layer characteristics and the high layer characteristics to obtain multi-scale image characteristics.
In this embodiment, the detection unit is further configured to:
acquiring the high-level image characteristics;
and acquiring the position information and the classification result of the detection target by using a configured positioning classification model, wherein the positioning classification model is determined based on two parts of contents of the RPN and the head structure. .
A third embodiment of the present application, as shown in fig. 14, is an electronic device, which can be understood as a physical device, including a processor and a memory storing processor-executable instructions, which when executed by the processor, perform the following operations:
step S1, respectively preprocessing an infrared image and a visible light image, wherein the infrared image and the visible light image comprise at least one detection target;
step S2, respectively extracting image features of the infrared image and the visible light image by using the configured double-flow backbone network and carrying out feature fusion to obtain multi-scale image features;
and step S3, based on the multi-scale image characteristics, acquiring the position information and the classification result of the detection target by using a configured positioning classification model.
In the fourth embodiment of the present application, the flow of the method for detecting a weak and small target by combining infrared and visible light features of the present embodiment is the same as that of the first, second or third embodiment, and the difference is that in engineering implementation, the present embodiment may be implemented by means of software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the method of the present application may be embodied in the form of a computer software product stored on a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) comprising instructions for causing an apparatus to perform the method of the embodiments of the present application.
While the application has been described in connection with specific embodiments thereof, it is to be understood that these drawings are included in the spirit and scope of the application, it is not to be limited thereto.

Claims (10)

1. The method for detecting the weak and small target by combining the infrared and visible light features is characterized by comprising the following steps of:
step S1, respectively preprocessing an infrared image and a visible light image, wherein the infrared image and the visible light image comprise at least one detection target;
step S2, respectively extracting image features of the infrared image and the visible light image by using the configured double-flow backbone network and carrying out feature fusion to obtain multi-scale image features;
and step S3, based on the multi-scale image characteristics, acquiring the position information and the classification result of the detection target by using a configured positioning classification model.
2. The method for detecting a small target by fusion of infrared and visible light features according to claim 1, wherein the step S1 comprises:
acquiring the infrared image and the visible light image;
performing differential operation on the infrared image and the blurred image after low-pass filtering based on a configured non-sharpening masking edge and detail enhancement algorithm to obtain high-frequency components representing the edge and detail of the infrared image, multiplying the high-frequency components with a configured gain coefficient, and then overlapping an original image to obtain an infrared image with enhanced target edge and detail;
and filtering the visible light image by using histogram equalization, enhancing the filtered visible light image, increasing the number of samples, and scaling the current visible light image based on a configured bilinear difference algorithm.
3. The method for detecting a small target by fusion of infrared and visible light features according to claim 2, wherein the step S2 comprises:
extracting features of the infrared image and the visible light image by using a configured ResNet network, wherein the extracted features comprise shallow features and high-level features;
the spatial information of the target area corresponding to the image is enhanced by using the configured spatial information guiding module, and shallow layer characteristics of the two images are guided to be fused;
at least one of SE module or coordinate attention module is adopted to strengthen semantic information of high-level features, expansion convolution with different expansion rates is adopted to capture multi-scale context information of the two-mode high-level features, and the multi-scale context information is spliced with the strengthened high-level features to obtain the high-level features after the two-mode enhancement;
and fusing the current shallow layer characteristics and the high layer characteristics to obtain multi-scale image characteristics.
4. The method for detecting a small target by combining infrared and visible light features according to claim 3, wherein the step S3 comprises:
acquiring the high-level image characteristics;
and acquiring the position information and the classification result of the detection target by using a configured positioning classification model, wherein the positioning classification model is determined based on two parts of contents of the RPN and the head structure.
5. An infrared and visible light characteristic fusion dim target detection device, characterized by comprising:
the preprocessing unit is configured to respectively preprocess an infrared image and a visible light image, wherein the infrared image and the visible light image comprise at least one detection target;
the fusion unit is configured to extract image features of the infrared image and the visible light image respectively by using the configured double-flow backbone network and perform feature fusion so as to obtain multi-scale image features;
and the detection unit is configured to acquire the position information and the classification result of the detection target by using a configuration positioning classification model based on the multi-scale image characteristics.
6. The dim target detection device according to claim 5, wherein the preprocessing unit is further configured to:
acquiring the infrared image and the visible light image;
performing differential operation on the infrared image and the blurred image after low-pass filtering based on a configured non-sharpening masking edge and detail enhancement algorithm to obtain high-frequency components representing the edge and detail of the infrared image, multiplying the high-frequency components with a configured gain coefficient, and then overlapping an original image to obtain an infrared image with enhanced target edge and detail;
and filtering the visible light image by using histogram equalization, enhancing the filtered visible light image, increasing the number of samples, and scaling the current visible light image based on a configured bilinear difference algorithm.
7. The dim target detection device according to claim 5, wherein the fusion unit is further configured to:
extracting features of the infrared image and the visible light image by using a configured ResNet network, wherein the extracted features comprise shallow features and high-level features;
the spatial information of the target area corresponding to the image is enhanced by using the configured spatial information guiding module, and shallow layer characteristics of the two images are guided to be fused;
at least one of SE module or coordinate attention module is adopted to strengthen semantic information of high-level features, expansion convolution with different expansion rates is adopted to capture multi-scale context information of the two-mode high-level features, and the multi-scale context information is spliced with the strengthened high-level features to obtain the high-level features after the two-mode enhancement;
and fusing the current shallow layer characteristics and the high layer characteristics to obtain multi-scale image characteristics.
8. The dim target detection device according to claim 5, wherein the detection unit is further configured to:
acquiring the high-level image characteristics;
and acquiring the position information and the classification result of the detection target by using a configured positioning classification model, wherein the positioning classification model is determined based on two parts of contents of the RPN and the head structure.
9. An electronic device, the electronic device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor performs the steps of the method for detecting a dim target with fusion of infrared and visible light features as claimed in any one of claims 1 to 4.
10. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for detecting a small target of fusion of infrared and visible light features as claimed in any one of claims 1 to 4.
CN202310813236.6A 2023-07-05 2023-07-05 Method and device for detecting weak and small targets by combining infrared and visible light characteristics Pending CN116958782A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310813236.6A CN116958782A (en) 2023-07-05 2023-07-05 Method and device for detecting weak and small targets by combining infrared and visible light characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310813236.6A CN116958782A (en) 2023-07-05 2023-07-05 Method and device for detecting weak and small targets by combining infrared and visible light characteristics

Publications (1)

Publication Number Publication Date
CN116958782A true CN116958782A (en) 2023-10-27

Family

ID=88461207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310813236.6A Pending CN116958782A (en) 2023-07-05 2023-07-05 Method and device for detecting weak and small targets by combining infrared and visible light characteristics

Country Status (1)

Country Link
CN (1) CN116958782A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117726958A (en) * 2024-02-07 2024-03-19 国网湖北省电力有限公司 Intelligent detection and hidden danger identification method for inspection image target of unmanned aerial vehicle of distribution line
CN117974960A (en) * 2024-03-28 2024-05-03 临沂大学 Double-light-fusion dynamic super-resolution layered sensing method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117726958A (en) * 2024-02-07 2024-03-19 国网湖北省电力有限公司 Intelligent detection and hidden danger identification method for inspection image target of unmanned aerial vehicle of distribution line
CN117726958B (en) * 2024-02-07 2024-05-10 国网湖北省电力有限公司 Intelligent detection and hidden danger identification method for inspection image target of unmanned aerial vehicle of distribution line
CN117974960A (en) * 2024-03-28 2024-05-03 临沂大学 Double-light-fusion dynamic super-resolution layered sensing method
CN117974960B (en) * 2024-03-28 2024-06-18 临沂大学 Double-light-fusion dynamic super-resolution layered sensing method

Similar Documents

Publication Publication Date Title
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN108229490B (en) Key point detection method, neural network training method, device and electronic equipment
CN113362329B (en) Method for training focus detection model and method for recognizing focus in image
CN111738110A (en) Remote sensing image vehicle target detection method based on multi-scale attention mechanism
CN116958782A (en) Method and device for detecting weak and small targets by combining infrared and visible light characteristics
Lewis et al. Generative adversarial networks for SAR image realism
CN114202743A (en) Improved fast-RCNN-based small target detection method in automatic driving scene
CN111898432A (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN112465700B (en) Image splicing positioning device and method based on depth clustering
Xiang et al. License plate detection based on fully convolutional networks
CN115527098A (en) Infrared small target detection method based on global mean contrast space attention
CN116953702A (en) Rotary target detection method and device based on deduction paradigm
CN116977747A (en) Small sample hyperspectral classification method based on multipath multi-scale feature twin network
CN115205793B (en) Electric power machine room smoke detection method and device based on deep learning secondary confirmation
CN116188944A (en) Infrared dim target detection method based on Swin-transducer and multi-scale feature fusion
CN116310568A (en) Image anomaly identification method, device, computer readable storage medium and equipment
Zhao et al. Deep learning-based laser and infrared composite imaging for armor target identification and segmentation in complex battlefield environments
CN115661451A (en) Deep learning single-frame infrared small target high-resolution segmentation method
Ke et al. Scale-aware dimension-wise attention network for small ship instance segmentation in synthetic aperture radar images
Dong et al. Intelligent pixel-level pavement marking detection using 2D laser pavement images
CN115512428B (en) Face living body judging method, system, device and storage medium
CN113506272B (en) False video detection method and system
Jiwane et al. Real-Time Object Measurement Using Image Processing
CN116894959B (en) Infrared small target detection method and device based on mixed scale and focusing network
Nan et al. Material-aware multiscale atrous convolutional network for prohibited items detection in x-ray image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination