CN118155036A

CN118155036A - Image detection method and device based on multispectral image fusion, medium and equipment

Info

Publication number: CN118155036A
Application number: CN202410564425.9A
Authority: CN
Inventors: 张文宇; 李勇智; 胡伟; 魏强; 王程英; 鲁雪平; 张勋; 黄梓瑞
Original assignee: Xi'an Ordnance Industry Technology Industry Development Co ltd
Current assignee: Xi'an Ordnance Industry Technology Industry Development Co ltd
Priority date: 2024-05-09
Filing date: 2024-05-09
Publication date: 2024-06-07
Anticipated expiration: 2044-05-09
Also published as: CN118155036B

Abstract

The application relates to the technical field of multispectral cross-modal fusion, and particularly discloses an image detection method and device, a medium and equipment based on multispectral image fusion, wherein the method comprises the following steps: acquiring an RGB image and a multispectral image corresponding to a target area, and respectively carrying out feature extraction on the RGB image and the multispectral image to obtain a first extracted feature and a second extracted feature; inputting the first extracted feature and the second extracted feature into a target fusion module, and performing feature processing through the target fusion module to obtain a first fusion feature corresponding to the RGB image and a second fusion feature corresponding to the multispectral image; performing feature fusion on a first fusion feature corresponding to the RGB image and a second fusion feature corresponding to the multispectral image to obtain a target fusion feature, and constructing a target fusion image corresponding to a target region according to the target fusion feature; and inputting the target fusion image into an image detection model to obtain a detection result of the multispectral object.

Description

Image detection method and device based on multispectral image fusion, medium and equipment

Technical Field

The application relates to the technical field of multispectral cross-modal fusion, in particular to an image detection method, device, medium and equipment based on multispectral image fusion.

Background

In detecting real objects, since the environment in which the object is located is often open and dynamic, models and algorithms are often required to address challenges presented by the open environment, such as rain, fog, occlusion, light differences, low resolution, etc. in the open environment. Under the above conditions, if the algorithm detects an object using only image data of a visible band, the detection effect tends to be unsatisfactory. In this context, multispectral imaging techniques have evolved. The multispectral image data may provide combined information under multispectral, such as visible light information and infrared information. Multispectral images have significant advantages over images that reflect only visible light information and images that reflect only infrared information. However, in the detection of objects using multispectral images, the inherent properties between different modalities need to be fully exploited, and therefore, a good cross-modality fusion mechanism becomes critical.

In the prior art, when a multispectral image and an RGB image are fused, feature extraction is generally performed on the multispectral image and the RGB image respectively to obtain extraction features corresponding to the multispectral image and extraction features corresponding to the RGB image, then the two extraction features are spliced, and a fused image corresponding to the multispectral image and the RGB image is obtained according to a splicing result. However, the fusion image obtained by the method has poor effect, and the detection accuracy in the subsequent multispectral object detection by using the fusion image is directly affected.

Disclosure of Invention

In view of the above, the application provides an image detection method, an image detection device, a medium and an image detection device based on multispectral image fusion, which are characterized in that two paths of input features are further extracted through a combined self-attention unit in each fusion submodule to obtain spectrum correlation features and space similarity features, and then the spectrum correlation features and the space similarity features are subjected to feature fusion to obtain spectrum-space fusion features, so that the primary fusion of multisource images is realized; the deep cross-modal fusion unit further carries out cross-modal learning on spectrum-space fusion characteristics, can naturally and simultaneously carry out intra-modal and inter-modal fusion, and deeply learns potential relations among multi-source images, so that characteristic information mined from two images is more sufficient; and adding the cross-modal fusion characteristics obtained by cross-modal learning as supplementary information to an original modal branch, and finally obtaining target fusion characteristics after subsequent operation, so that the fusion effect is greatly improved, and the detection performance of subsequent multispectral objects is improved.

According to one aspect of the present application, there is provided an image detection method based on multispectral image fusion, comprising:

Acquiring an RGB image and a multispectral image corresponding to a target area, and respectively carrying out feature extraction on the RGB image and the multispectral image to obtain a first extraction feature and a second extraction feature;

Inputting the first extracted features and the second extracted features into a target fusion module, and performing feature processing through the target fusion module to obtain first fusion features corresponding to the RGB images and second fusion features corresponding to the multispectral images, wherein the target fusion module comprises at least one fusion sub-module, and each fusion sub-module comprises a joint self-attention unit, a deep cross-mode fusion unit, a first feature fusion unit and a second feature fusion unit; the combined self-attention unit obtains spectrum correlation characteristics and space similarity characteristics according to two paths of input characteristics, and performs characteristic fusion on the spectrum correlation characteristics and the space similarity characteristics to obtain spectrum-space fusion characteristics; the deep cross-modal fusion unit performs cross-modal learning on the spectrum-space fusion feature to obtain a cross-modal fusion feature, and obtains a first supplementary feature of the first extraction feature and a second supplementary feature of the second extraction feature according to the cross-modal fusion feature; the first feature fusion unit obtains a first fusion feature according to the first extraction feature and the first supplementary feature; the second feature fusion unit obtains a second fusion feature according to the second extraction feature and the second supplementary feature;

Obtaining target fusion characteristics according to the first fusion characteristics corresponding to the RGB image and the second fusion characteristics corresponding to the multispectral image, and constructing a target fusion image corresponding to the target area according to the target fusion characteristics;

And inputting the target fusion image into an image detection model to obtain a detection result of the multispectral object.

According to another aspect of the present application, there is provided an image detection apparatus based on multispectral image fusion, comprising:

The image acquisition module is used for acquiring an RGB image and a multispectral image corresponding to the target area, and respectively extracting the characteristics of the RGB image and the multispectral image to obtain a first extracted characteristic and a second extracted characteristic;

The feature processing module is used for inputting the first extracted features and the second extracted features to a target fusion module, performing feature processing through the target fusion module to obtain first fusion features corresponding to the RGB images and second fusion features corresponding to the multispectral images, wherein the target fusion module comprises at least one fusion sub-module, and each fusion sub-module comprises a joint self-attention unit, a deep cross-modal fusion unit, a first feature fusion unit and a second feature fusion unit; the combined self-attention unit obtains spectrum correlation characteristics and space similarity characteristics according to two paths of input characteristics, and performs characteristic fusion on the spectrum correlation characteristics and the space similarity characteristics to obtain spectrum-space fusion characteristics; the deep cross-modal fusion unit performs cross-modal learning on the spectrum-space fusion feature to obtain a cross-modal fusion feature, and obtains a first supplementary feature of the first extraction feature and a second supplementary feature of the second extraction feature according to the cross-modal fusion feature; the first feature fusion unit obtains a first fusion feature according to the first extraction feature and the first supplementary feature; the second feature fusion unit obtains a second fusion feature according to the second extraction feature and the second supplementary feature;

The image construction module is used for obtaining target fusion characteristics according to the first fusion characteristics corresponding to the RGB image and the second fusion characteristics corresponding to the multispectral image, and constructing a target fusion image corresponding to the target area according to the target fusion characteristics;

and the detection module is used for inputting the target fusion image into an image detection model to obtain a detection result of the multispectral object.

According to still another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described image detection method based on multispectral image fusion.

According to still another aspect of the present application, there is provided a computer apparatus including a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, the processor implementing the above image detection method based on multispectral image fusion when executing the program.

By means of the technical scheme, the image detection method, the device, the medium and the equipment based on multispectral image fusion can acquire RGB images and multispectral images aiming at the same target area, and respectively perform feature extraction on the two images to obtain a first extraction feature and a second extraction feature. And then, performing feature processing on the first extracted features and the second extracted features by using a target fusion module, and finally outputting first fusion features corresponding to the RGB images and second fusion features corresponding to the multispectral images. Specifically, the target fusion module may include one fusion sub-module or may include a plurality of fusion sub-modules. Each fusion sub-module may include a joint self-attention unit, a deep cross-modality fusion unit, a first feature fusion unit, and a second feature fusion unit. The combined self-attention unit can process two paths of input features to obtain spectrum correlation features and space similarity features, and perform feature fusion on the spectrum correlation features and the space similarity features to obtain spectrum-space fusion features. The deep cross-modal fusion unit can perform cross-modal learning on the spectrum-space fusion characteristics to obtain cross-modal fusion characteristics, and perform inverse operation on the cross-modal fusion characteristics to obtain first supplementary characteristics and second supplementary characteristics. Then, the first feature fusion unit may perform an addition operation on the first extracted feature and the first supplementary feature to obtain a first fusion feature; and meanwhile, the second feature fusion unit can perform addition operation on the second extracted feature and the second supplementary feature to obtain a second fusion feature. After the first fusion feature and the second fusion feature are obtained, the first fusion feature and the second fusion feature can be added, so that the target fusion feature is obtained. Finally, image reconstruction can be performed by utilizing the target fusion characteristics, a target fusion image corresponding to the target region is finally obtained, and then the multispectral object is detected through the target fusion image. According to the embodiment of the application, the two paths of input features are further extracted through the combined self-attention unit in each fusion sub-module to obtain the spectrum correlation features and the space similarity features, and then the spectrum correlation features and the space similarity features are subjected to feature fusion to obtain spectrum-space fusion features, so that the primary fusion of the multi-source images is realized; the deep cross-modal fusion unit further carries out cross-modal learning on spectrum-space fusion characteristics, can naturally and simultaneously carry out intra-modal and inter-modal fusion, and deeply learns potential relations among multi-source images, so that characteristic information mined from two images is more sufficient; and adding the cross-modal fusion characteristics obtained by cross-modal learning as supplementary information to an original modal branch, and finally obtaining target fusion characteristics after subsequent operation, so that the fusion effect is greatly improved, and the detection performance of subsequent multispectral objects is improved.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 shows a schematic flow chart of an image detection method based on multispectral image fusion according to an embodiment of the application;

fig. 2 is a schematic flow chart of another image detection method based on multispectral image fusion according to an embodiment of the present application;

Fig. 3 is a schematic structural diagram of a target fusion module according to an embodiment of the present application;

fig. 4 shows a schematic structural diagram of a fusion sub-module according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a combined self-attention unit according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a spectrum attention unit according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a channel-space attention unit according to an embodiment of the present application;

fig. 8 illustrates a schematic structural diagram of a deep cross-modal fusion unit according to an embodiment of the present application;

FIG. 9 is a schematic diagram of still another image detection method based on multispectral image fusion according to an embodiment of the present application;

fig. 10 shows a schematic structural diagram of an image detection device based on multispectral image fusion according to an embodiment of the present application;

fig. 11 shows a schematic device structure of a computer device according to an embodiment of the present application.

Detailed Description

The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

In this embodiment, an image detection method based on multispectral image fusion is provided, as shown in fig. 1, and the method includes:

Step 101, an RGB image and a multispectral image corresponding to a target area are obtained, and feature extraction is carried out on the RGB image and the multispectral image respectively to obtain a first extraction feature and a second extraction feature.

The image detection method based on multispectral image fusion provided by the embodiment of the application can improve the detection accuracy of multispectral objects. The application can fuse the traditional RGB image and the corresponding multispectral image, and fully consider the space self-similarity information in the RGB image and the spectrum correlation information of the multispectral image in the fusion process, thereby improving the fusion effect of the final image. First, an RGB image and a multispectral image for the same target area may be acquired. Specifically, in order to ensure that the RGB image and the multispectral image correspond to the regions completely coincide, the RGB image and the multispectral image may be subjected to registration processing. Then, the RGB image may be subjected to feature extraction to obtain a first extracted feature, and at the same time, the multispectral image may be subjected to feature extraction to obtain a second extracted feature. In the feature extraction, a convolution layer or the like may be used for feature extraction. It should be noted that after feature extraction, the dimensions between the obtained first extracted feature and the second extracted feature are the same.

102, Inputting the first extracted feature and the second extracted feature to a target fusion module, and performing feature processing through the target fusion module to obtain a first fusion feature corresponding to the RGB image and a second fusion feature corresponding to the multispectral image, wherein the target fusion module comprises at least one fusion sub-module, and each fusion sub-module comprises a joint self-attention unit, a deep cross-modal fusion unit, a first feature fusion unit and a second feature fusion unit; the combined self-attention unit obtains spectrum correlation characteristics and space similarity characteristics according to two paths of input characteristics, and performs characteristic fusion on the spectrum correlation characteristics and the space similarity characteristics to obtain spectrum-space fusion characteristics; the deep cross-modal fusion unit performs cross-modal learning on the spectrum-space fusion feature to obtain a cross-modal fusion feature, and obtains a first supplementary feature of the first extraction feature and a second supplementary feature of the second extraction feature according to the cross-modal fusion feature; the first feature fusion unit obtains a first fusion feature according to the first extraction feature and the first supplementary feature; and the second feature fusion unit obtains a second fusion feature according to the second extraction feature and the second supplementary feature.

In the embodiment of the present application, optionally, when the target fusion module includes a plurality of fusion sub-modules, each of the fusion sub-modules is connected in sequence, and the output of the previous fusion sub-module is the input of the next adjacent fusion sub-module; when any fusion sub-module is the first fusion sub-module in the target fusion module, the two paths of input features of any fusion sub-module are the first extraction feature and the second extraction feature respectively, otherwise, the two paths of input features of any fusion sub-module are the first fusion feature and the second fusion feature output by the previous adjacent fusion sub-module respectively.

In this embodiment, after the first extraction feature and the second extraction feature are obtained, the target fusion module may be used to perform feature processing on the first extraction feature and the second extraction feature, and finally output a first fusion feature corresponding to the RGB image and a second fusion feature corresponding to the multispectral image. Specifically, the target fusion module may include one fusion sub-module or may include a plurality of fusion sub-modules. When one fusion sub-module is included, the input of the fusion sub-module can be the first extraction feature and the second extraction feature, and the output is the first fusion feature corresponding to the RGB image and the second fusion feature corresponding to the multispectral image. When a plurality of fusion submodules are included, each fusion submodule is connected in sequence, for example, the fusion submodule comprises a fusion submodule 1, a fusion submodule 2 and a fusion submodule 3, the fusion submodule 1 is connected with the fusion submodule 2, and the fusion submodule 2 is connected with the fusion submodule 3. Each fusion submodule comprises two paths of input features, and for a first fusion submodule in a plurality of fusion submodules, the two paths of input features are the first extraction feature and the second extraction feature respectively, and the output is the first fusion feature and the second fusion feature; for the rest fusion sub-modules, the two paths of input features of each fusion sub-module are respectively the output of the last adjacent fusion sub-module, namely the first fusion feature and the second fusion feature output by the last adjacent fusion sub-module, and the output is the updated first fusion feature and the updated second fusion feature; and taking the updated first fusion characteristic output by the last fusion submodule as a first fusion characteristic corresponding to the RGB image, and taking the updated second fusion characteristic output by the last fusion submodule as a second fusion characteristic corresponding to the multispectral image.

In particular, each fusion sub-module may include a joint self-attention unit, a deep cross-modality fusion unit, a first feature fusion unit, and a second feature fusion unit. The combined self-attention unit can perform fusion processing on two paths of input features to obtain spectrum-space fusion features. Here, the joint self-attention unit may make full use of the spectral correlation and the spatial self-similarity of the multi-source image, thereby enabling the spectral-spatial fusion feature to contain more spectral information and spatial information. Then, the deep cross-modal fusion unit can perform cross-modal learning on the spectrum-space fusion characteristics to obtain cross-modal fusion characteristics corresponding to the spectrum-space fusion characteristics, and prior knowledge in related RGB images and multispectral images can be learned through cross-modal learning. The deep cross-modal fusion unit can naturally and simultaneously perform intra-modal and inter-modal fusion, can effectively utilize self-similarity of images, and has advantages compared with convolution operation in modeling long-distance dependency. In addition, the deep cross-modal fusion unit can also perform inverse operation on the cross-modal fusion characteristics to obtain a first supplementary characteristic and a second supplementary characteristic. The first supplementary feature is used for supplementing information to the first extracted feature, and the second supplementary feature is used for supplementing information to the second extracted feature. Then, the first feature fusion unit may perform an addition operation on the first extracted feature and the first supplementary feature to obtain a first fusion feature; and meanwhile, the second feature fusion unit can perform addition operation on the second extracted feature and the second supplementary feature to obtain a second fusion feature.

And step 103, obtaining target fusion characteristics according to the first fusion characteristics corresponding to the RGB image and the second fusion characteristics corresponding to the multispectral image, and constructing a target fusion image corresponding to the target region according to the target fusion characteristics.

In this embodiment, after the first fusion feature corresponding to the RGB image and the second fusion feature corresponding to the multispectral image are obtained, the first fusion feature and the second fusion feature may be added, so as to obtain the target fusion feature. Then, image reconstruction can be performed by utilizing the target fusion characteristics, and finally a target fusion image corresponding to the target area is obtained. The target fusion image comprises more and finer spatial features and spectral features, so that feature information in the RGB image and the multispectral image is mined more sufficiently, and meanwhile, the deep cross-mode fusion unit can naturally and simultaneously perform intra-mode and inter-mode fusion and capture potential interaction between the multispectral images, so that the fusion quality of the final target fusion image is remarkably improved.

And 104, inputting the target fusion image into an image detection model to obtain a detection result of the multispectral object.

In this embodiment, the obtained target fusion image may be directly input into the image detection model, and the target fusion image is detected by the image detection model, so as to determine the multispectral object included in the target region according to the result. The training sample of the image detection model may be a sample fusion image obtained in the above manner, that is, the sample fusion image is determined by the methods of the steps 101 to 103 according to the sample RGB image and the sample multispectral image corresponding to the preset area, then the sample multispectral object included in the sample fusion image is marked, and the initial model is trained according to the marked sample fusion image, so as to finally obtain the image detection model. According to the embodiment of the application, the target fusion image comprises abundant space information and spectrum information, so that the detection and identification of the multispectral object are carried out through the obtained target fusion image, the detection accuracy can be greatly improved, and the improvement of the multispectral object detection performance is facilitated.

By applying the technical scheme of the embodiment, firstly, an RGB image and a multispectral image aiming at the same target area can be obtained, and the two images are respectively subjected to feature extraction to obtain a first extraction feature and a second extraction feature. And then, performing feature processing on the first extracted features and the second extracted features by using a target fusion module, and finally outputting first fusion features corresponding to the RGB images and second fusion features corresponding to the multispectral images. Specifically, the target fusion module may include one fusion sub-module or may include a plurality of fusion sub-modules. Each fusion sub-module may include a joint self-attention unit, a deep cross-modality fusion unit, a first feature fusion unit, and a second feature fusion unit. The combined self-attention unit can process two paths of input features to obtain spectrum correlation features and space similarity features, and perform feature fusion on the spectrum correlation features and the space similarity features to obtain spectrum-space fusion features. The deep cross-modal fusion unit can perform cross-modal learning on the spectrum-space fusion characteristics to obtain cross-modal fusion characteristics, and perform inverse operation on the cross-modal fusion characteristics to obtain first supplementary characteristics and second supplementary characteristics. Then, the first feature fusion unit may perform an addition operation on the first extracted feature and the first supplementary feature to obtain a first fusion feature; and meanwhile, the second feature fusion unit can perform addition operation on the second extracted feature and the second supplementary feature to obtain a second fusion feature. After the first fusion feature and the second fusion feature are obtained, the first fusion feature and the second fusion feature can be added, so that the target fusion feature is obtained. Finally, image reconstruction can be performed by utilizing the target fusion characteristics, a target fusion image corresponding to the target region is finally obtained, and then the multispectral object is detected through the target fusion image. According to the embodiment of the application, the two paths of input features are further extracted through the combined self-attention unit in each fusion sub-module to obtain the spectrum correlation features and the space similarity features, and then the spectrum correlation features and the space similarity features are subjected to feature fusion to obtain spectrum-space fusion features, so that the primary fusion of the multi-source images is realized; the deep cross-modal fusion unit further carries out cross-modal learning on spectrum-space fusion characteristics, can naturally and simultaneously carry out intra-modal and inter-modal fusion, and deeply learns potential relations among multi-source images, so that characteristic information mined from two images is more sufficient; and adding the cross-modal fusion characteristics obtained by cross-modal learning as supplementary information to an original modal branch, and finally obtaining target fusion characteristics after subsequent operation, so that the fusion effect is greatly improved, and the detection performance of subsequent multispectral objects is improved.

Further, as a refinement and extension of the foregoing embodiment, in order to fully describe the implementation process of this embodiment, another image detection method based on multispectral image fusion is provided, as shown in fig. 2, where the method includes:

Step 201, an RGB image and a multispectral image corresponding to a target area are obtained, the RGB image is subjected to dimension increasing processing by at least one convolution layer to obtain the first extracted feature, and the multispectral image is subjected to dimension increasing processing by at least one convolution layer to obtain the second extracted feature, wherein the dimensions corresponding to the first extracted feature and the second extracted feature are the same.

In this embodiment, first, an RGB image and a multispectral image corresponding to the target area may be acquired, where the RGB image and the multispectral image may be subjected to calibration processing first. After the calibration processing, one or more convolution layers can be used to perform dimension-increasing processing on the RGB image, so that the first extraction feature can be correspondingly obtained. Similarly, the multi-spectrum image can be subjected to dimension-increasing processing by using one or more convolution layers, so that the second extraction feature is correspondingly obtained. For example, the sizes of the RGB image and the multispectral image are 640×640, the convolution layers include three layers, each layer of convolution layer serves to up-dimension the image, and if the number of channels of the RGB image and the multispectral image after passing through the three layers of convolution layers is up-dimension to 128-dimension, the extracted features obtained at this time may include more features in the image, that is, up-dimension is to learn more abundant features. Here, the dimensions of the extracted first extracted feature and second extracted feature are the same for performing the following operations.

Step 202, inputting the first extracted feature and the second extracted feature to a target fusion module, and performing feature processing through the target fusion module to obtain a first fusion feature corresponding to the RGB image and a second fusion feature corresponding to the multispectral image, wherein the target fusion module comprises at least one fusion sub-module, and each fusion sub-module comprises a joint self-attention unit, a deep cross-modal fusion unit, a first feature fusion unit and a second feature fusion unit; the combined self-attention unit obtains spectrum correlation characteristics and space similarity characteristics according to two paths of input characteristics, and performs characteristic fusion on the spectrum correlation characteristics and the space similarity characteristics to obtain spectrum-space fusion characteristics; the deep cross-modal fusion unit performs cross-modal learning on the spectrum-space fusion feature to obtain a cross-modal fusion feature, and obtains a first supplementary feature of the first extraction feature and a second supplementary feature of the second extraction feature according to the cross-modal fusion feature; the first feature fusion unit obtains a first fusion feature according to the first extraction feature and the first supplementary feature; and the second feature fusion unit obtains a second fusion feature according to the second extraction feature and the second supplementary feature.

In this embodiment, when a plurality of fusion sub-modules are included in the target fusion module, the structure of the target fusion module may be as shown in fig. 3. As shown in fig. 3, the target fusion module includes p fusion sub-modules, these fusion sub-modules are sequentially connected, the output of the previous fusion sub-module is the input of the next fusion sub-module, each fusion sub-module performs feature update on two paths of input features, and the updated first fusion feature and second fusion feature output by the last fusion sub-module p can be used as the first fusion feature of the RGB image and the second fusion feature of the multispectral image. In addition, the structure of each fusion sub-module may be as shown in fig. 4.

Step 203, obtaining a target fusion feature according to the first fusion feature corresponding to the RGB image and the second fusion feature corresponding to the multispectral image, and constructing a target fusion image corresponding to the target region according to the target fusion feature.

And 204, inputting the target fusion image into an image detection model to obtain a detection result of the multispectral object.

In an embodiment of the present application, optionally, the joint self-attention unit includes a first convolution unit, a spectral attention unit, a channel-space attention unit, and a second convolution unit; in step 202, "the combined self-attention unit obtains a spectral correlation feature and a spatial similarity feature according to two paths of input features, and performs feature fusion on the spectral correlation feature and the spatial similarity feature to obtain a spectral-spatial fusion feature", including: the first convolution unit performs feature fusion on the two paths of input features to obtain shared input features; the spectrum attention unit extracts spectrum correlation characteristics according to a first path of input characteristics and the shared input characteristics, wherein the first path of input characteristics are the first extraction characteristics or first fusion characteristics output by a previous fusion submodule; the channel-space attention unit extracts space similarity features according to a second path of input features and the shared input features, wherein the second path of input features are second fusion features output by the second extraction features or a previous fusion submodule; and the second convolution unit performs fusion processing on the spectrum correlation characteristic and the space similarity characteristic to obtain the spectrum-space fusion characteristic.

In this embodiment, the multispectral image itself has self-similarity of texture in space, and the RGB image generally has higher spatial resolution and considerable detail and contrast. If the RGB image and the multispectral image are directly spliced and input into the depth network, the loss of space details and spectral information is easy to cause. From the intrinsic characteristics of the images, the application adopts a combined self-attention unit, and fully utilizes the spectral correlation and the spatial self-similarity of the multi-source images. By combining the spectral attention units in the self-attention unit, spectral correlation features can be extracted from the multispectral image; using the channel-spatial attention unit of the joint self-attention unit, spatial similarity features can be extracted from the RGB image; and finally, according to the spectral correlation characteristics and the spatial similarity characteristics, jointly guiding the fusion of the multi-source images, so that the multi-spectral images and the RGB images are fused more effectively.

In the present application, as shown in fig. 5, each joint self-attention unit may include the following parts, namely a first convolution unit, a spectral attention unit, a channel-space attention unit, and a second convolution unit. Wherein the input of the first convolution unit is the concatenation result of the first path input feature F _y and the second path input feature F _z (in fig. 5Representing the splice operation) and convolving the splice result to obtain the shared input feature F _v. Here, when the joint self-attention unit is the joint self-attention unit in the first fusion sub-module in the target fusion module, the first path of input feature of the joint self-attention unit may be a first extraction feature, and the second path of input feature may be a second extraction feature; when the joint self-attention unit is the joint self-attention unit in the non-first fusion submodule in the target fusion module, the first path of input feature of the joint self-attention unit can be the first fusion feature output by the previous adjacent fusion submodule, and the second path of input feature can be the second fusion feature output by the previous adjacent fusion submodule. The spectral attention unit may then extract spectral correlation features from the shared input features F _v and the first path input features F _y output by the first convolution unit. Here, the first input feature F _y for extracting the spectral correlation feature may be the first extracted feature or the first fusion feature output by the previous adjacent fusion sub-module, specifically, the position of the fusion sub-module to which the current spectral attention unit belongs in the target fusion module is determined. Meanwhile, the channel-space attention unit may extract a spatial similarity feature according to the shared input feature F _v and the second input feature F _z output by the first convolution unit, where the second input feature F _z used to extract the spatial similarity feature may be a second extracted feature or a second fusion feature output by a previous adjacent fusion submodule, and specifically, the position of the fusion submodule to which the current channel-space attention unit belongs in the target fusion module is determined. After the spectral attention unit outputs the spectral correlation feature and the channel-space attention unit outputs the spatial similarity feature, the spectral correlation feature and the spatial similarity feature may be spliced, and then the second convolution unit may perform a convolution operation on the spliced result, thereby outputting a spectral-space fusion feature. It should be noted that the number of convolution layers of the first convolution unit and the number of convolution layers of the second convolution unit may be determined as required, for example, the first convolution unit may be a 3×3 convolution layer and the second convolution unit may be a 1×1 convolution layer. The convolution layer can realize dimension-lifting operation, nonlinear operation and the like on the input features, so that the nonlinearity of the processed features is enhanced, and the features with higher dimension and deeper layers can be learned later.

In an embodiment of the present application, optionally, the spectral attention unit includes a first processing subunit, a second processing subunit, and a third processing subunit; the spectrum attention unit extracts spectrum correlation characteristics according to the first path of input characteristics and the shared input characteristics, and the spectrum attention unit comprises: the first processing subunit performs convolution and matrix transformation processing on the first path of input features to obtain a spectrum query matrix and a spectrum key matrix, and obtains a spectrum attention matrix according to the spectrum query matrix and the spectrum key matrix; the second processing subunit performs convolution and matrix transformation processing on the shared input features to obtain a first value matrix; and the third processing subunit obtains the spectrum correlation characteristic according to the spectrum attention matrix and the first value matrix.

In this embodiment, as shown in fig. 6, the spectral attention unit may include a first processing subunit, a second processing subunit, and a third processing subunit. The first processing subunit may perform convolution and matrix transformation processing on the first path of input feature F _y, so as to obtain a corresponding spectrum query matrix and a spectrum key matrix. The first processing subunit may then multiply the spectral query matrix with the spectral key matrix to obtain a spectral attention matrix based on the product. In addition, the second processing subunit may perform convolution and matrix transformation processing on the shared input feature F _v, so as to obtain a first value matrix. As shown in fig. 6, to enhance the spectral information characterization capability of the resulting spectral correlation feature, a first path of input features from multiple spectraFirst, a spectrum query matrix/>, is obtained through a first processing subunitAnd spectral bond matrix/>Here, the first processing subunit may include two parallel convolution layers, specifically, may be a1×1 convolution layer, where one convolution layer processes the first path input feature F _y, and then performs matrix transformation on the processing result to obtain a spectrum query matrix/>The other convolution layer processes the first path of input characteristics F _y, and then performs matrix transformation on the processing result, so as to obtain a spectrum key matrix/>. The first processing subunit may then compute, by matrix multiplication, a spectral query matrix/>And spectral bond matrix/>Multiplying and obtaining a corresponding spectrum attention matrix according to the result. Shared input features of preliminary fusion/>Obtaining a first matrix of values/>, by a second processing subunitHere, the second processing subunit may comprise a convolution layer, in particular a1×1 convolution layer. The above-mentioned spectral query matrix/>Spectral key matrix/>And a first value matrix/>The acquisition process of (1) can be expressed as:

；

In the method, in the process of the invention, The linear mapping of the 1x1 convolutional layer is shown, respectively.

And then, the third processing subunit can calculate the spectrum attention matrix and the first value matrix through matrix multiplication to obtain a target attention matrix, wherein the target attention moment matrix represents the correlation information of spectrum dimensions in the multispectral image. Subsequently, the target attention moment matrix can be subjected to matrix transformation, a matrix transformation result is input into a1×1 convolution layer, linear mapping is performed, and then output is performed, and finally spectral correlation characteristics are output. Wherein the process of obtaining the target attention matrix can be expressed as:

；

Wherein, Representing a target attention matrix; /(I)Representing a weight parameter which can be learned in the target fusion module; in utilizing the spectral query matrix/>And spectral bond matrix/>In the process of obtaining the spectrum attention matrix by multiplying the result, the spectrum is inquired into the matrix/>And spectral bond matrix/>The result of the multiplication is divided by/>According to division/>The obtained result is subjected to softmax activation function operation, and finally a spectrum attention matrix is obtained, namely the spectrum attention matrix can be expressed as: . Since softmax activation function operation is generally omitted from the figure, it is not shown in fig. 6.

According to the embodiment of the application, the spectrum correlation characteristic is obtained through the spectrum attention unit, so that the spectrum information can be effectively extracted from the multispectral image.

In an embodiment of the present application, optionally, the channel-space attention unit includes a channel attention subunit, a space attention subunit, a fourth processing subunit, and a fifth processing subunit; the channel-space attention unit extracts a spatial similarity feature from a second path input feature and the shared input feature, comprising: the channel attention subunit performs global average pooling operation on the second path of input features to obtain first average pooling features, performs global maximum pooling operation on the second path of input features to obtain first maximum pooling features, inputs the first average pooling features and the first maximum pooling features into a first multi-layer sensor, obtains channel attention features according to an output result of the first multi-layer sensor, and multiplies the second path of input features by the channel attention features to obtain channel enhancement features; the space attention subunit performs global average pooling operation on the channel enhancement features to obtain second average pooling features, and performs global maximum pooling operation on the channel enhancement features to obtain second maximum pooling features, and performs convolution processing on the second average pooling features and the second maximum pooling features, so as to obtain space attention map according to convolution processing results, and obtain a space attention matrix according to the space attention map and the channel enhancement features; the fourth processing subunit performs convolution and matrix transformation processing on the shared input features to obtain a second value matrix; and the fifth processing subunit obtains the spatial similarity feature according to the spatial attention matrix and the second value matrix.

In this embodiment, as shown in fig. 7, the channel-space attention unit may include a channel attention subunit, a space attention subunit, a fourth processing subunit, and a fifth processing subunit. The channel attention subunit may perform global average pooling operation on the second input feature F _z, or may perform global maximum pooling operation on the second input feature F _z at the same time. Aggregating spatial information in the second path input features F _z by a global averaging pooling operation and a global maximization pooling operation, thereby generating a first averaging pooling feature of two different spatial contextsAnd a first max pooling feature/>. Then, the channel attention subunit may input the first average pooling feature and the first maximum pooling feature into the first multi-layer sensor (Multilayer Perceptron, MLP), and perform feature fusion on two output results of the first multi-layer sensor to obtain a channel attention feature/>. In particular, channel attention features/>, are derived in the channel attention subunit(I.e., the following/>)) The working procedure of (2) can be expressed as follows:

；

Wherein, Is a Sigmoid activation function, avgPool represents an average pooling operation, maxPool represents a maximum pooling operation,/>And/>Indicating the weights of the MLP.

The channel attention subunit may then multiply the second input feature with the channel attention feature to obtain a channel enhancement feature. The spatial attention subunit may perform a global average pooling operation on the channel enhancement features, while performing a global maximum pooling operation on the channel enhancement features. The global average pooling operation can correspondingly obtain the second average pooling characteristic/>Performing global max pooling operations may correspond to obtaining second max pooling features/>. The spatial attention subunit may then convolve the second averaged pooled feature and the second maximally pooled feature, weighting the second averaged pooled feature and the second maximally pooled feature by the convolution layer to generate an effective feature vector. The convolution result is connected with an activation function, and the spatial attention map M _s is obtained through the activation function processing. In particular, spatial attention map M _s (i.e., described below) The acquisition procedure of (1) can be expressed as follows:

；

Wherein, Is a Sigmoid activation function,/>Representing convolution operations,/>Representing a convolution operation with a convolution kernel of 7 x 7. The size of the convolution kernel can be determined according to the requirement, and the bigger the size of the convolution kernel is, the bigger the receptive field is, so that the extraction effect of the space attention diagram can be improved; however, the larger the size of the convolution kernel, the larger the calculation amount, so that an appropriate convolution kernel size can be selected according to the requirements.

The spatial attention subunit may further multiply the spatial attention map with the channel enhancement features to obtain a spatial attention matrix.

The fourth processing subunit may also process the shared input features in the course of processing the second path input features F _z by the channel attention subunit and the spatial attention subunitAnd performing convolution and matrix transformation processing to obtain a second value matrix. After obtaining the spatial attention matrix and the second value matrix, the fifth processing subunit may multiply the spatial attention matrix and the second value matrix, transform the result by the matrix, and output the result after convolution by 1×1, thereby obtaining the spatial similarity feature.

In the embodiment of the present application, optionally, in step 202, "the deep cross-modal fusion unit performs cross-modal learning on the spectrum-space fusion feature to obtain a cross-modal fusion feature, and obtains a first supplementary feature of the first extracted feature and a second supplementary feature of the second extracted feature according to the cross-modal fusion feature", including: splitting the spectrum-space fusion characteristic into a plurality of sub-fusion characteristics according to the number of channels, flattening each sub-fusion characteristic, and splicing the flattened sub-fusion characteristics to obtain a target flattening sequence corresponding to the spectrum-space fusion characteristic; adding the target flattening sequence and the corresponding position embedding matrix to obtain a first fusion matrix; carrying out normalization processing on the first fusion matrix, and respectively calculating the normalization processing result based on a multi-head attention mechanism to obtain a target query matrix, a target key matrix and a target value matrix corresponding to each attention mechanism; obtaining an attention matrix corresponding to each attention mechanism according to the target query matrix, the target key matrix and the target value matrix; splicing all the attention matrixes, and multiplying the spliced result by a projection matrix to obtain a second fusion matrix fused with multi-head attention information; obtaining a third fusion matrix according to the first fusion matrix and the second fusion matrix; normalizing the third fusion matrix, inputting the normalization result into a second multi-layer perceptron, and obtaining a cross-modal fusion characteristic according to the output result of the second multi-layer perceptron and the third fusion matrix; and performing inverse operation processing on the cross-modal fusion features to obtain the first supplementary features and the second supplementary features.

In this embodiment, the multi-spectral image and the RGB image are fused from two dimensions of the spectral information and the channel-space information by the combined self-attention unit to obtain a spectral-space fusion feature, but the fusion feature is still shallow, so that in order to improve the quality of the fusion feature, the deep cross-modal fusion unit can be used to further mine the spectral-space fusion feature. Convolutional neural networks lack exploration of long-range dependencies between features, whereas transform networks can effectively exploit the self-similarity of images, with advantages over convolutional operations in modeling long-range dependencies. A transducer network is a neural network that learns context and thus meaning by tracking relationships in sequence data. The Transformer network employs a set of evolving mathematical techniques called attention or self-attention to detect even the subtle ways in which remote data elements in a sequence interact and interdepend. The application utilizes a transducer network to construct a deep-layer cross-mode fusion unit, and the structure of the deep-layer cross-mode fusion unit can be shown as shown in figure 8.

The deep cross-modal fusion unit can split the spectrum-space fusion characteristics according to the channel number to obtain a plurality of sub-fusion characteristics, and then can flatten each sub-fusion characteristic to obtain a plurality of flattened sub-fusion characteristics. For example, if the number of channels of the spectrum-space fusion feature is a, then the spectrum-space fusion feature can be split to obtain a sub-fusion features. Assuming that each sub-fusion feature is m rows and k columns, flattening each sub-fusion feature, namely flattening each sub-fusion feature into 1 row and m x k columns, and then splicing the a flattened sub-fusion features in sequence to finally obtain a target flattening sequence of 1 row and m x k x a columns. Here, the flattening process may use a flat method. Thereafter, a learnable position embedding matrix, which is a trainable parameter, may be added, where the position embedding matrix may be used to distinguish spatial information between different channels during training, indicating the original position of each element in the target flattened sequence in the spectral-spatial fusion feature. After adding the position embedding matrix, the target flattening sequence and the position embedding matrix can be added to obtain a first fusion matrix。

Next, for the first fusion matrixAnd carrying out normalization processing, and projecting a result I after the normalization processing onto three weight matrixes corresponding to each attention mechanism. Here, the three weight matrices included in each of the multiple attention mechanisms are different, through which different aspects of the first fusion matrix can be learned. A set of target query matrix Q, target key matrix K, and target value matrix V is calculated by each of the multiple attention mechanisms.

；

Wherein,,/>,/>Is a weight matrix corresponding to each attention mechanism, each attention mechanism may calculate an attention weight using a scaled dot product between Q and K, and then may derive an attention matrix corresponding to each attention mechanism. For the ith attention mechanism, the resulting attention matrix may be denoted as Z _i, and the attention matrix Z _i may be calculated as follows:

；

wherein Attention represents the Attention mechanism; representing a transpose of the target key matrix K; /(I) Is the dimension of the target key matrix K, in the present application/>，/>Is used for preventing/>, when the size of dot product becomes largeThe function falls into a scaling factor for a region with very small gradients. In order to encapsulate multiple complex relationships from different representation subspaces at different locations, a multi-headed attention mechanism is employed. After the attention matrix corresponding to each attention mechanism in the multi-head attention mechanism is obtained, the attention moment matrices can be spliced, and the splicing result is multiplied by the projection matrix W, so that a second fusion matrix/>, which fuses multi-attention information, is obtained。

；

Wherein MultiHead denotes a multi-head attention mechanism, n denotes the number of multi-head attention mechanisms, concat denotes a splicing operation, and W denotes Concat%) Is provided.

Thereafter, the first fusion matrix may be usedSecond fusion matrix/>Adding to obtain a third fusion matrix/>. And normalizing the third fusion matrix, and normalizing the characteristics from the channel dimension through a normalization layer. And inputting the normalization processing result into a second multi-layer sensor, and enabling the second multi-layer sensor to be activated by two fully-connected layers with a GELU activation function in the middle. And adding the output result of the second multi-layer sensor and the third fusion matrix to obtain the cross-modal fusion characteristic. The process of acquiring the cross-modal fusion feature O may be expressed as follows (note that the processing object of the MLP in the following formula is actually the third fusion matrix/>, which is subjected to normalization processing）：

；

Wherein MLP (English full name: multilayer Perceptron) represents a multi-layer sensor, FC (English full name: fully Connected Layer) represents a full connection layer, FC ₁ represents a first full connection layer in a second multi-layer sensor, FC ₂ represents a second full connection layer in the second multi-layer sensor, and finally the trans-modal fusion feature O is adjusted and output as a first supplementary feature by the inverse operation of the first stepAnd a second supplemental feature/>And added as supplemental information to the original modality branch.

Upon tuning the cross-modal fusion feature O to a first supplemental featureSecond supplemental feature/>In this case, since the cross-modal fusion feature O is a sequence having the same size as the target flattened sequence, the cross-modal fusion feature O can be reduced to a matrix having the same size as the spectrum-space fusion feature by using the inverse process corresponding to the process of obtaining the target flattened sequence. Then dividing the matrix into two sub-matrices, wherein each sub-matrix has the same size as the spectral correlation feature and the spatial similarity feature, and the two sub-matrices are the first supplementary feature/>And a second supplemental feature/>. Because the spectrum-space fusion feature is obtained by splicing the spectrum correlation feature and the space similarity feature, if the size of the spectrum correlation feature and the space similarity feature is H×W×C, the size of the spectrum-space fusion feature obtained by splicing is 2×H×W×C, therefore, when the cross-mode fusion feature O is reduced to be the same as the size of the spectrum-space fusion feature, the size of the matrix is 2×H×W×C, and then the matrix with the size of 2×H×W×C is segmented according to the splicing mode of the spectrum correlation feature and the space similarity feature, so that the first supplementary feature/> isobtainedSecond supplemental feature/>。

Further, the present application provides another image detection method based on multispectral image fusion, as shown in fig. 9, the method includes:

First, an RGB image and a multispectral image corresponding to a target area may be acquired, where the RGB image and the multispectral image may be subjected to calibration processing first. After the calibration process, one or more convolution layers (h in the figure represents the total number of convolution layers) may be used to perform an up-dimension process on the RGB image, so as to correspondingly obtain the first extraction feature. Similarly, the multi-spectrum image can be subjected to dimension-increasing processing by using one or more convolution layers, so that the second extraction feature is correspondingly obtained. Here, the dimensions of the extracted first extracted feature and second extracted feature are the same for performing the following operations. After the first extraction feature and the second extraction feature are obtained, a target fusion module can be utilized to perform feature processing on the first extraction feature and the second extraction feature, and finally, the first fusion feature corresponding to the RGB image and the second fusion feature corresponding to the multispectral image are output. After the first fusion feature corresponding to the RGB image and the second fusion feature corresponding to the multispectral image are obtained, the first fusion feature and the second fusion feature can be added, and then the target fusion feature is obtained. Then, image reconstruction can be performed by utilizing the target fusion characteristics, and finally a target fusion image corresponding to the target area is obtained. The obtained target fusion image can be directly input into an image detection model, the image detection model detects the target fusion image, and then the multispectral object included in the target area is determined according to the result.

Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides an image detection apparatus based on multispectral image fusion, as shown in fig. 10, where the apparatus includes:

Optionally, when the target fusion module includes a plurality of fusion sub-modules, each of the fusion sub-modules is connected in sequence, and the output of the previous fusion sub-module is the input of the next adjacent fusion sub-module; when any fusion sub-module is the first fusion sub-module in the target fusion module, the two paths of input features of any fusion sub-module are the first extraction feature and the second extraction feature respectively, otherwise, the two paths of input features of any fusion sub-module are the first fusion feature and the second fusion feature output by the previous adjacent fusion sub-module respectively.

Optionally, the joint self-attention unit comprises a first convolution unit, a spectral attention unit, a channel-space attention unit, and a second convolution unit;

the first convolution unit is used for carrying out feature fusion on the two paths of input features to obtain shared input features;

The spectrum attention unit is used for extracting spectrum correlation characteristics according to a first path of input characteristics and the shared input characteristics, wherein the first path of input characteristics are the first extraction characteristics or the first fusion characteristics output by a previous fusion submodule;

The channel-space attention unit is used for extracting space similarity features according to a second path of input features and the shared input features, wherein the second path of input features are second fusion features output by the second extraction features or a previous fusion submodule;

and the second convolution unit is used for carrying out fusion processing on the spectrum correlation characteristic and the space similarity characteristic to obtain the spectrum-space fusion characteristic.

Optionally, the spectral attention unit comprises a first processing subunit, a second processing subunit and a third processing subunit;

The first processing subunit is configured to perform convolution and matrix transformation processing on the first path of input features to obtain a spectrum query matrix and a spectrum key matrix, and obtain a spectrum attention matrix according to the spectrum query matrix and the spectrum key matrix;

the second processing subunit is used for carrying out convolution and matrix transformation processing on the shared input features to obtain a first value matrix;

And the third processing subunit is configured to obtain the spectral correlation feature according to the spectral attention matrix and the first value matrix.

Optionally, the channel-space attention unit includes a channel attention subunit, a space attention subunit, a fourth processing subunit, and a fifth processing subunit;

The channel attention subunit is configured to perform global average pooling operation on the second input feature to obtain a first average pooling feature, and perform global maximum pooling operation on the second input feature to obtain a first maximum pooling feature, input the first average pooling feature and the first maximum pooling feature into a first multi-layer sensor, obtain a channel attention feature according to an output result of the first multi-layer sensor, and multiply the second input feature with the channel attention feature to obtain a channel enhancement feature;

The spatial attention subunit is configured to perform global average pooling operation on the channel enhancement feature to obtain a second average pooling feature, and perform global maximum pooling operation on the channel enhancement feature to obtain a second maximum pooling feature, perform convolution processing on the second average pooling feature and the second maximum pooling feature, obtain a spatial attention map according to a convolution processing result, and obtain a spatial attention matrix according to the spatial attention map and the channel enhancement feature;

the fourth processing subunit is configured to perform convolution and matrix transformation processing on the shared input feature to obtain a second value matrix;

The fifth processing subunit is configured to obtain the spatial similarity feature according to the spatial attention matrix and the second value matrix.

Optionally, the deep cross-modal fusion unit is configured to:

splitting the spectrum-space fusion characteristic into a plurality of sub-fusion characteristics according to the number of channels, flattening each sub-fusion characteristic, and splicing the flattened sub-fusion characteristics to obtain a target flattening sequence corresponding to the spectrum-space fusion characteristic;

Adding the target flattening sequence and the corresponding position embedding matrix to obtain a first fusion matrix;

Carrying out normalization processing on the first fusion matrix, and respectively calculating the normalization processing result based on a multi-head attention mechanism to obtain a target query matrix, a target key matrix and a target value matrix corresponding to each attention mechanism;

Obtaining an attention matrix corresponding to each attention mechanism according to the target query matrix, the target key matrix and the target value matrix;

Splicing all the attention matrixes, and multiplying the spliced result by a projection matrix to obtain a second fusion matrix fused with multi-head attention information;

obtaining a third fusion matrix according to the first fusion matrix and the second fusion matrix;

normalizing the third fusion matrix, inputting the normalization result into a second multi-layer perceptron, and obtaining a cross-modal fusion characteristic according to the output result of the second multi-layer perceptron and the third fusion matrix;

and performing inverse operation processing on the cross-modal fusion features to obtain the first supplementary features and the second supplementary features.

Optionally, the apparatus further comprises:

the dimension increasing module is used for carrying out dimension increasing processing on the RGB image by utilizing at least one layer of convolution layer to obtain the first extracted feature, and carrying out dimension increasing processing on the multispectral image by utilizing at least one layer of convolution layer to obtain the second extracted feature, wherein the dimensions corresponding to the first extracted feature and the second extracted feature are the same.

It should be noted that, other corresponding descriptions of each functional unit related to the image detection device based on multispectral image fusion provided by the embodiment of the present application may refer to corresponding descriptions in the methods of fig. 1 to 9, and are not repeated here.

The embodiment of the application also provides a computer device, which can be a personal computer, a server, a network device and the like, and as shown in fig. 11, the computer device comprises a bus, a processor, a memory and a communication interface, and can also comprise an input/output interface and a display device. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing location information. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement the steps in the method embodiments.

It will be appreciated by those skilled in the art that the structure shown in FIG. 11 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer readable storage medium is provided, which may be non-volatile or volatile, and on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. An image detection method based on multispectral image fusion is characterized by comprising the following steps:

2. The method of claim 1, wherein when the target fusion module includes a plurality of fusion sub-modules, each of the fusion sub-modules is connected in sequence, and an output of a previous fusion sub-module is an input of a next adjacent fusion sub-module; when any fusion sub-module is the first fusion sub-module in the target fusion module, the two paths of input features of any fusion sub-module are the first extraction feature and the second extraction feature respectively, otherwise, the two paths of input features of any fusion sub-module are the first fusion feature and the second fusion feature output by the previous adjacent fusion sub-module respectively.

3. The method according to claim 1 or 2, wherein the joint self-attention unit comprises a first convolution unit, a spectral attention unit, a channel-space attention unit, and a second convolution unit; the combined self-attention unit obtains spectrum correlation characteristics and space similarity characteristics according to two paths of input characteristics, and performs characteristic fusion on the spectrum correlation characteristics and the space similarity characteristics to obtain spectrum-space fusion characteristics, and the combined self-attention unit comprises:

The first convolution unit performs feature fusion on the two paths of input features to obtain shared input features;

The spectrum attention unit extracts spectrum correlation characteristics according to a first path of input characteristics and the shared input characteristics, wherein the first path of input characteristics are the first extraction characteristics or first fusion characteristics output by a previous fusion submodule;

the channel-space attention unit extracts space similarity features according to a second path of input features and the shared input features, wherein the second path of input features are second fusion features output by the second extraction features or a previous fusion submodule;

and the second convolution unit performs fusion processing on the spectrum correlation characteristic and the space similarity characteristic to obtain the spectrum-space fusion characteristic.

4. A method according to claim 3, wherein the spectral attention unit comprises a first processing subunit, a second processing subunit, and a third processing subunit; the spectrum attention unit extracts spectrum correlation characteristics according to the first path of input characteristics and the shared input characteristics, and the spectrum attention unit comprises:

The first processing subunit performs convolution and matrix transformation processing on the first path of input features to obtain a spectrum query matrix and a spectrum key matrix, and obtains a spectrum attention matrix according to the spectrum query matrix and the spectrum key matrix;

the second processing subunit performs convolution and matrix transformation processing on the shared input features to obtain a first value matrix;

And the third processing subunit obtains the spectrum correlation characteristic according to the spectrum attention matrix and the first value matrix.

5. A method according to claim 3, wherein the channel-space attention unit comprises a channel attention subunit, a space attention subunit, a fourth processing subunit, and a fifth processing subunit; the channel-space attention unit extracts a spatial similarity feature from a second path input feature and the shared input feature, comprising:

The channel attention subunit performs global average pooling operation on the second path of input features to obtain first average pooling features, performs global maximum pooling operation on the second path of input features to obtain first maximum pooling features, inputs the first average pooling features and the first maximum pooling features into a first multi-layer sensor, obtains channel attention features according to an output result of the first multi-layer sensor, and multiplies the second path of input features by the channel attention features to obtain channel enhancement features;

The space attention subunit performs global average pooling operation on the channel enhancement features to obtain second average pooling features, and performs global maximum pooling operation on the channel enhancement features to obtain second maximum pooling features, and performs convolution processing on the second average pooling features and the second maximum pooling features, so as to obtain space attention map according to convolution processing results, and obtain a space attention matrix according to the space attention map and the channel enhancement features;

The fourth processing subunit performs convolution and matrix transformation processing on the shared input features to obtain a second value matrix;

And the fifth processing subunit obtains the spatial similarity feature according to the spatial attention matrix and the second value matrix.

6. The method according to claim 1 or 2, wherein the deep cross-modal fusion unit performs cross-modal learning on the spectrum-space fusion feature to obtain a cross-modal fusion feature, and obtains a first supplementary feature of the first extracted feature and a second supplementary feature of the second extracted feature according to the cross-modal fusion feature, including:

7. The method according to claim 1, wherein the feature extracting the RGB image and the multispectral image to obtain a first extracted feature and a second extracted feature includes:

And carrying out dimension lifting processing on the RGB image by using at least one convolution layer to obtain the first extraction feature, and carrying out dimension lifting processing on the multispectral image by using at least one convolution layer to obtain the second extraction feature, wherein the dimensions of the first extraction feature and the dimension of the second extraction feature are the same.

8. An image detection device based on multispectral image fusion, comprising:

9. A storage medium having stored thereon a computer program, which when executed by a processor, implements the method of any of claims 1 to 7.

10. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the method of any one of claims 1 to 7 when executing the computer program.