CN116310839A

CN116310839A - Remote sensing image building change detection method based on feature enhancement network

Info

Publication number: CN116310839A
Application number: CN202310426990.4A
Authority: CN
Inventors: 韩现伟; 孙宇; 张一民; 高伟; 杨光辉
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-06-23

Abstract

The invention discloses a remote sensing image building change detection method based on a characteristic enhancement network, which comprises the following steps: the first step: preparing a data set, and a second step: carrying out data enhancement; and a third step of: building a network model and training, and a fourth step: building change detection. The invention realizes the full fusion of different building information by introducing the visual transducer structure, the space and channel attention, the u-shaped residual error module, the enhanced feature extraction module and the self-attention feature fusion module, and can better distinguish buildings with different regular shapes and sizes so as to prevent false detection and missed detection, and simultaneously improve the feature extraction capability of the buildings with different shapes and the edge details thereof. The present invention has a higher F1 score and Kappa coefficient than a different advanced algorithm, such as BIT, changeformer.

Description

Remote sensing image building change detection method based on feature enhancement network

Technical Field

The invention relates to the technical field of remote sensing image processing, in particular to a remote sensing image building change detection method based on a characteristic enhancement network.

Background

Currently, since the twenty-first century, the living standard of people is gradually improved, and the development of urbanization is more and more important. The building is used as one of the marks of city construction, and the building can represent city change to a great extent, and has important significance for city planning management. The change detection refers to the process of observing the state difference of the same geographic position at different times, and has important significance for the change detection of buildings, the land resource utilization, the post-disaster reconstruction and the like.

The change detection population of the building can be classified into a conventional change detection algorithm and a deep learning-based change detection algorithm. In the conventional method, the change detection algorithm of the remote sensing image can be roughly divided into a direct comparison method and a classified comparison method. The direct comparison method is mainly used for analyzing geometric features, spectrum textures and the like of a building, and changing information is obtained from images through direct comparison. The algorithm of comparison after classification firstly carries out ground feature classification on remote sensing images in different periods, and then carries out comparison so as to determine the final change and unchanged area. The accuracy of the change detection is mainly determined by the classification result.

However, as buildings become more complex and remote sensing images contain more and more object information, the conventional methods are more and more difficult to meet the demands, and the conventional building change detection methods are mostly based on manually constructed features and are easily interfered by different information, such as noise, image registration and the like. Moreover, the features constructed by the traditional method can only be fitted to relatively simple buildings, and the fitting of the complex abstract building features is difficult. In addition, the traditional algorithm needs to rely on a great deal of expertise and experience of professionals in different fields when constructing features, and consumes a great deal of manpower and material resources, and the efficiency is low in most of the current modes based on manual field investigation. Therefore, an automated, intelligent, and fast building change detection method is increasingly required.

With the development of space remote sensing technology, deep learning has been applied to change detection, and has strong modeling and learning capabilities, and feature extraction and end-to-end change detection can be performed on images by establishing a series of models (such as UNet, stant and the like), so that the detection precision and speed are improved.

However, some existing model algorithms have the problems that the edge details of lost buildings are changed, small-scale target buildings in complex backgrounds are omitted, the detection effect on irregularly-shaped target buildings is poor, and the changes between different buildings with similar positions are difficult to distinguish due to the fact that the feature extraction capability is insufficient.

Disclosure of Invention

The invention aims to provide a remote sensing image building change detection method based on a characteristic enhancement network, which can improve the characteristic representation capability of the network so as to further improve the change detection precision.

The invention adopts the technical scheme that:

a remote sensing image building change detection method based on a characteristic enhancement network specifically comprises the following steps: the first step: a dataset is prepared, a public change detection dataset CDD is collected, the dataset comprising a validation set, a training set, and a test set. Each subset contains A, B and OUT folders, which respectively correspond to the pre-change image, the post-change image and the building label actually changed, and each image has a size of 256×256 pixels.

As a further improvement scheme of the technical scheme: and a second step of: data enhancement is performed. In order to enhance the identification capability of the network to buildings under different scenes and the robustness of the network, the generalization capability of the network is enhanced, and the data enhancement is carried out on the image by adopting methods of horizontal overturning, rotation and the like.

As a further improvement scheme of the technical scheme: and a third step of: and (5) building a network model and training. The input image is first input into a feature extractor to extract building features. The feature extractor consists of three parts: a primary feature extractor, an enhanced feature extractor, a ResNet decoder.

As a further improvement scheme of the technical scheme: the primary feature extractor consists mainly of a Unet code and a visual transducer structure. Each coding block of the Unet contains two convolutional layers, each of which outputs a feature map. These two feature maps are input into the VTS to obtain a larger receptive field and enhance the ability of the feature representation, ultimately outputting five feature maps.

As a further improvement scheme of the technical scheme: the fifth output feature map is input into an enhanced feature extractor, the enhanced feature extractor is composed of four modules, namely a space and channel attention module, a U-shaped residual module, an enhanced feature extraction module and a self-attention feature fusion module, which are combined to further enhance the representation capability and the robustness of the network to the building features.

As a further improvement scheme of the technical scheme: the space and channel attention module consists of space attention and channel attention, and the attention to the feature map is increased in the channel dimension and the space dimension, so that the expression capability of the network to important building features can be effectively enhanced.

As a further improvement scheme of the technical scheme: the U-shaped residual module may better capture global and local information to enhance building feature extraction, and the enhanced feature extraction module may improve the ability to extract representative building features from the channel and space dimensions.

As a further improvement scheme of the technical scheme: the self-attention feature fusion module fully merges feature information through different operations (such as summation, subtraction and stitching).

As a further improvement scheme of the technical scheme: the outputs of the enhancement feature extractor and the primary feature extractor are input into a ResNet decoder for decoding, and finally two feature graphs are output.

As a further improvement scheme of the technical scheme: the two output feature maps of the feature extractor are input into the cross-channel up and down Wen Yuyi aggregation module for full fusion of the channel information.

As a further improvement scheme of the technical scheme: the output signature of the cross-channel up and down Wen Yuyi aggregation module is input into a convolution layer to obtain the final change detection map.

As a further improvement scheme of the technical scheme: the loss function used for network training is a combination of cross entropy loss, dice loss and Focal loss to improve the impact of imbalance between changing and unchanged buildings.

As a further improvement scheme of the technical scheme: the test set samples are input into a trained network model to predict a building change map.

According to the invention, by introducing a visual transducer structure, spatial correlation is provided for buildings with different levels of characteristic diagrams, and the recognition capability of a network to the buildings at different positions is enhanced; furthermore, by introducing the attention of the space and the channel, the interference of irrelevant background information is filtered from the space dimension and the channel dimension, and the detection capability of the small-scale building is improved.

Furthermore, the u-shaped residual error module and the enhanced feature extraction module are designed to improve the feature extraction capability of buildings with different shapes and edge details thereof. The self-attention feature fusion module is provided to realize the full fusion of different building information, and can better distinguish buildings with different regular shapes and sizes so as to prevent false detection and missed detection.

Furthermore, a cross-channel up-down Wen Yuyi aggregation module is designed to perform information aggregation in the channel dimension, so that context semantic information can be better utilized, and in order to reduce information loss during feature map merging, the detection capability of a network on a building is improved. The present invention has a higher F1 score and Kappa coefficient than a different advanced algorithm, such as BIT, changeformer.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a change detection flow in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network model structure according to an embodiment of the present invention;

FIG. 3 is a schematic view of a visual transducer according to an embodiment of the present invention;

FIG. 4 is a block diagram of a space and channel attention module in accordance with an embodiment of the present invention;

FIG. 5 is a block diagram of a u-shaped residual module and an enhanced feature extraction module in accordance with an embodiment of the present invention;

FIG. 6 is a block diagram of a self-attention feature fusion module in accordance with an embodiment of the present invention;

FIG. 7 is a block diagram of a cross-channel up and down Wen Yuyi aggregation module in accordance with an embodiment of the present invention;

FIG. 8 is a comparison of the detection results of the building change according to the present invention with other prior art advanced methods.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, 2 and 3, the present invention includes the steps of:

the first step: preparing a data set; first, a dataset is prepared. A public change detection dataset CDD is collected, the dataset comprising 11 pairs of remote sensing images of seasonal changes, wherein 7 pairs of images are 4725 x 2700 pixels in size and 4 pairs of images are 1900 x 1000 pixels in size. The resolution is 0.03m/pixel to 1m/pixel. The image size was cut to 256×256 pixels, 10000 for training, 3000 for verification, and 3000 for testing.

And a second step of: carrying out data enhancement; in order to enhance the identification capability of the network to buildings under different scenes and the robustness of the network, the generalization capability of the network is enhanced, and the data enhancement is carried out on the image by adopting methods of horizontal overturning, rotation and the like.

And a third step of: building a network model and training; the model structure of the feature enhancement network is shown in fig. 2. The input image is first input into a feature extractor to extract building features. The feature extractor consists of three parts: a primary feature extractor, an enhanced feature extractor, a ResNet decoder.

The primary feature extractor mainly comprises a Unet code and a visual transducer structure. Each coding block of the Unet contains two convolutional layers, each of which outputs a feature map. These two feature maps are input into the VTS to obtain a larger receptive field and enhance the ability of the feature representation. Finally, the primary feature extractor outputs five feature maps. The fifth feature is input to an enhanced feature extractor for further feature enhancement. The enhanced feature extractor consists of four modules. The first four output features and the fifth enhancement feature are input into the ResNet structure for decoding. After decoding, the feature extractor may output two feature maps of the same size. The two feature maps are input into the cross-channel up and down Wen Yuyi aggregation module for fusion. Finally, we obtain the output change detection graph through two convolution layers with convolution kernel size of 3×3 and a convolution layer with convolution kernel size of 1×1, and the channel number of the output change graph is 2.

Primary feature extractor

As shown in fig. 2 (b), the primary feature extractor employs UNet encoding structures and Visual Transducer Structures (VTS). Each coding block contains two convolutional layers, each of which outputs a feature map. These two feature maps are input into the VTS. The structure of the VTS is shown in fig. 3. Because the receptive field of the feature map after convolution is larger than before convolution, the second feature map contains more semantic information. After patch pulsing, the first property is used as a query vector. The second feature is used as a key vector and a value vector. The multi-headed attention is set to 12 in the present invention. Meanwhile, the first feature is used as a position matrix, and is overlapped with the second feature to be used as an input. Also, to reduce network parameters, we set the transducer block to 1. After transforming the scale of the feature map, the size is 768×16×16. They are then input into the transpose convolution layer to change the size and channel number of the feature map. The final output profile has the same size as the input profile. The Resnet feature decoder uses mainly a part of the Resnet18 network architecture, first expanding the size of the input feature map to twice the original size by transpose convolution with a core size of 7 x 7. The feature map is then input into a residual module, which is followed by a Dropout layer to reduce overfitting. And then, after being spliced with the feature images with the corresponding scale, the feature images are input into a residual error module, and the process is circulated until a final output feature image is obtained.

Space and channel attention module

As shown in fig. 4, channel attention uses an adaptive average pooling operation to pool each channel of the input feature map. Two fully connected layers are then used to reduce the characteristic parameters and the Relu function is used to increase the nonlinearity. And inputting the fully connected result into a sigmoid function to execute weight normalization, and multiplying the weight by each element of the input feature map to obtain the channel attention feature map. Spatial attention employs average pooling, which takes global feature information into account, and maximum pooling, which is used to mine the representative features of a building, and which uses different sized pooling cores to perform the pooling operation. The left branch uses a pooled core of size 3×3 and the middle branch uses a pooled core of size 5×5. Smaller pooled cores may capture finer target features, while larger pooled cores may mine for richer target features. After the pooling operation is performed, a 1×1 convolutional layer is used to adjust the number of channels. And then, the results output by different pooling operations are spliced together to fuse information. The result is input into two convolution layers to obtain initial weights, and final weights are calculated through a sigmoid function. And finally, adding the spatial attention and the channel attention to obtain a final output.

u-shape residual error module and enhanced feature extraction module

As shown in fig. 5 (a), the U-shaped residual block is divided into two parts: an upper branch and a lower branch. The size of the input feature map is reduced to half its original size with a max pooling layer with a pooling kernel size of 2 x 2. Deep semantic features are then extracted through the four convolution layers. The above information is used to reduce information loss according to the skipped connection. The outputs of the fourth convolution layer and the third convolution layer are spliced together and input into the convolution layers. After three identical operations are performed, the output is up-sampled. The output and input features are then added. For the following branches, the operations are the same except that the maximum pooling is replaced by the average pooling.

As shown in fig. 5 (b), the enhanced feature extraction module is divided into a left branch and a right branch. The left branch mainly extracts feature information in the spatial dimension, and the right branch mainly mines features in the channel dimension. The left branch first compresses the channels of the input feature to 1 by a convolution of 1 x 1. Its spatial features are then extracted by two convolution layers. The result is fed to a sigmoid function to perform weight normalization. The weights are multiplied by the input features, and a weight is assigned to each feature element. And finally, obtaining left branch output through addition. The left side of the right branch uses an average pooling operation to pool each channel information of the input feature. The right side of the right branch then uses a max pooling operation in order to comprehensively consider the feature information from a global and local perspective. The channels are then compressed and expanded using two convolution layers of 1 x 1, with the result being activated by the sigmoid function. After multiplying the activation result by the input feature, the left and right features are added to obtain the final output.

Self-attention feature fusion module

As shown in fig. 6, to better fuse feature information, three operations are used: addition, subtraction and concatenation. Two convolution layers are then used to obtain depth representative features. In adjusting the number of channels with a 1 x 1 convolution, it is known from self-attention that this mechanism can correlate pixels at different locations, which can identify a building well. Thus, consider the subtractive branch as a query vector, the additive branch as a key vector, and the splice branch as a value vector. The subtraction output is subjected to reshape and transpose operations, and the addition output is subjected to reshape. Then, they are multiplied and the multiplication result is activated using a sigmoid function, and then the feature information of the splice branches is fused using two convolution layers. Then multiplying the result by a final weight matrix to construct a space long-distance dependency relationship, and carrying out reshape transformation on the multiplication result. Finally, the result and the two input features are added element by element, and the final output is considered as a weighted sum of the input features and all the position features.

Cross-channel up-down Wen Yuyi aggregation module

In this module, as shown in fig. 7, the middle feature map is obtained by performing channel stitching on the left feature map and the right feature map, then the size of the middle feature is compressed into 1×1 by the adaptive averaging pooling layer, the channel number is adjusted by the convolution layer of 1×1, the convolution output is stitched with the middle feature map, the stitching result is transmitted to the convolution layer of 1×1, the convolution result is input to the right branch for connection and fusion, the weight matrix is normalized by using a sigmoid function, and the aggregation and fusion of channel information can be fully realized through the left branch and the right branch. Two convolution layers are used to extract multi-scale features of left and right branch features. Finally, multiplying the weight matrix by the convolution result, and adding the weight matrix and the convolution result element by element, so that the channel information of the output feature images can realize good intersection.

The loss function used in the training of the present network is a combination of cross entropy loss, dice loss and Focal loss, as follows, to improve the impact of imbalance between the changing building and the unchanged building.

L＝L _bce +δL _dc +φL _fc (4)

y _n Wherein represents true change of earth's surface, p _n Representing the predicted architectural changes, H, W represent the height and width of the image, respectively. Alpha is super parameter, alpha is more than or equal to 0, p is estimated probability of model, and the value range is 0,1]. Delta phi is used for balancing loss L _bce 、L _dc 、L _fc 。

Fourth, detecting building change. And outputting a change detection graph through the test set sample image after the network training is finished and converged.

In order to verify the effectiveness of the invention, the CDD data set is used for training and testing different algorithm models, and the training and testing are performed in the same environment. The algorithm used for comparison was STANET, SNUNET, BIT, changeformer, IDET. The test is carried out by 5 evaluation indexes, wherein the evaluation indexes are respectively an Overall Accuracy (overlay Accuracy), a Precision (Precision), a Recall (Recall), an F1 fraction (F1-score) and a Kappa coefficient, and F1 is a harmonic average value of the Precision and the Recall, and the larger the value is, the better. The specific evaluation index results are shown in table 1.

It can be seen from table 1 that the inventive process is all over 5 existing advanced processes in all metrics, which demonstrates the effectiveness of the inventive process.

The method of the present invention is shown in fig. 8 in comparison with other prior art methods for detecting building changes.

As can be seen from fig. 8, in the change detection result of the method of the present invention, the proposed model has a more complete and accurate detection result. And the method can well distinguish and distinguish the irregular-shaped target building from the change between different buildings with similar positions, can also filter the influence of background noise, and can enhance the detection capability of small-scale target buildings.

The present invention has been described in detail with reference to the drawings and the embodiments, but the present invention is not limited to the embodiments, and various changes can be made by the above-disclosed technical matters within the knowledge of those skilled in the art without departing from the spirit of the present invention. The invention may be practiced otherwise than as specifically described.

In the description of the present invention, it should be noted that, for the azimuth words such as "center", "lateral", "longitudinal", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc., the azimuth and positional relationships are based on the azimuth or positional relationships shown in the drawings, it is merely for convenience of describing the present invention and simplifying the description, and it is not to be construed as limiting the specific scope of protection of the present invention that the device or element referred to must have a specific azimuth configuration and operation.

It should be noted that the terms "comprises" and "comprising," along with any variations thereof, in the description and claims of the present application are intended to cover non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed.

Note that the above is only a preferred embodiment of the present invention and uses technical principles. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the present invention has been described in connection with the above embodiments, it is to be understood that the invention is not limited to the specific embodiments disclosed and that many other and equally effective embodiments may be devised without departing from the spirit of the invention, and the scope thereof is determined by the scope of the appended claims.

Claims

1. A remote sensing image building change detection method based on a characteristic enhancement network is characterized by comprising the following steps of: the method comprises the following steps:

the first step: preparing a data set, namely collecting a public change detection data set CDD, wherein the data set comprises a verification set, a training set and a test set. Each subset comprises A, B and OUT folders, which respectively correspond to an image before change, an image after change and a building label which truly changes, and the size of each image is 256 multiplied by 256 pixels;

and a second step of: data enhancement is performed: in order to enhance the identification capability of the network to buildings under different scenes and the robustness of the network, the generalization capability of the network is enhanced, and the data enhancement is carried out on the image by adopting methods of horizontal overturning, rotation and the like;

and a third step of: building a network model and training; the network model comprises:

feature extractor: the input image is first input into a feature extractor for extracting building features;

cross-channel up and down Wen Yuyi aggregation module: for sufficient fusion of channel information;

fourth step: building change detection; the test set samples are input into a trained network model to predict a building change map.

2. The method for detecting the change of the building by the remote sensing image based on the characteristic enhancement network according to claim 1, wherein the characteristic extractor comprises:

primary feature extractor: for enhancing feature expression capability; the primary feature extractor mainly consists of a Unet code and a visual transducer structure; each coding block of the Unet comprises two convolution layers, and each convolution layer outputs a feature map; these two feature maps are input into the VTS to obtain a larger receptive field and enhance the ability of feature representation, ultimately outputting five feature maps; the method specifically comprises the following steps:

after patch unbinding, the first property is used as a query vector;

using the second feature as a key vector and a value vector;

the multi-headed attention is set to 12 in the present invention, while the first feature is used as a position matrix, which is superimposed with the second feature as input, we set the transducer block to 1;

after transforming the scale of the feature map, the size is 768×16×16. Then, inputting them into a transpose convolution layer to change the size and channel number of the feature map, and finally outputting the feature map with the same size as the input feature map;

enhancement feature extractor: further enhancing the characteristic expression capability;

and the ResNet decoder is used for decoding the output of the enhanced feature extractor and the primary feature extractor, inputting the output to the ResNet decoder, and finally outputting two feature graphs.

3. The remote sensing image building change detection method based on the feature enhancement network according to claim 2, wherein the fifth output feature map is input into an enhancement feature extractor, the feature expression capability is further enhanced, the enhancement feature extractor is composed of four modules, namely a space and channel attention module, a U-shaped residual module, an enhancement feature extraction module and a self-attention feature fusion module, which are combined to further enhance the representation capability and the robustness of the network to the building features, wherein the space and channel attention module is composed of space attention and channel attention, the attention to the feature map is increased in the channel dimension and the space dimension, and the expression capability of the network to important building features can be effectively enhanced; the U-shaped residual error module is used for enhancing the extraction of building characteristics and capturing global and local information better; the enhanced feature extraction module is used for improving the capability of extracting representative building features from the channel and space dimensions, and the self-attention feature fusion module fully combines feature information through summation, difference and splicing.

4. A remote sensing image building change detection method based on a feature enhancement network according to claim 3, wherein the Resnet feature decoder uses mainly a part of the Resnet18 network structure, and first expands the size of the input feature map to twice the original size by transposed convolution with a kernel size of 7 x 7. The feature map is then input into a residual module, which is followed by a Dropout layer to reduce overfitting. And then, after being spliced with the feature images with the corresponding scale, the feature images are input into a residual error module, and the process is circulated until a final output feature image is obtained.

5. The remote sensing image building change detection method based on the characteristic enhancement network according to claim 3, wherein the space and channel attention module specifically comprises the following steps:

channel attention uses an adaptive average pooling operation to pool each channel of the input feature map;

then using two fully connected layers to reduce the characteristic parameters, the Relu function is used to increase the nonlinearity;

and inputting the fully connected result into a sigmoid function to execute weight normalization, and multiplying the weight by each element of the input feature map to obtain the channel attention feature map.

6. The method for detecting changes in remote sensing image buildings based on characteristic enhancement network according to claim 3, wherein the loss function used in the network training in the third step is a combination of cross entropy loss, dice loss and Focal loss, and is used for improving the influence caused by unbalance between a changed building and a unchanged building, and the method is specifically represented by the following formula:

L＝L _bce +δL _dc +φL _fc (4)

y _n wherein represents true change of earth's surface, p _n Representing predicted architectural changes, H, W representing the height and width of the image, respectively; alpha is super parameter, alpha is more than or equal to 0, p is estimated probability of model, and the value range is 0,1]The method comprises the steps of carrying out a first treatment on the surface of the Delta phi is used for balance loss; l (L) _bce Is a cross entropy loss function, L _dc 、L _fc The Dice loss function and the Focal loss function, respectively.

7. The remote sensing image building change detection method based on the characteristic enhancement network according to claim 3, wherein the method comprises the following steps: the U-shaped residual error module is divided into two parts: an upper branch and a lower branch; the specific application method comprises the following steps:

reducing the size of the input feature map to half its original size with a max pooling layer having a pooling kernel size of 2 x 2;

deep semantic features are then extracted through the four convolution layers. Using the above information to reduce information loss according to the skipped connection;

splicing the outputs of the fourth convolution layer and the third convolution layer, and inputting the outputs into the convolution layers;

after three identical operations are performed, the output is up-sampled. Then adding the output and input features;

the lower branch operates the same as the upper branch except that the maximum pooling is replaced by the average pooling.

8. The remote sensing image building change detection method based on the characteristic enhancement network according to claim 3, wherein the method comprises the following steps: the enhancement feature extraction module is divided into a left branch and a right branch; the left branch mainly extracts feature information in the space dimension, and the right branch mainly excavates features in the channel dimension; the left branch first compresses the channel of the input feature to 1 by a convolution of 1×1; then extracting its spatial features through two convolution layers; the result is fed to a sigmoid function to perform weight normalization; multiplying the weight and the input feature, and distributing the weight for each feature element; finally, obtaining left branch output through addition; the left side of the right branch uses an average pooling operation to pool each channel information of the input feature; the right side of the right branch then uses a max pooling operation in order to comprehensively consider feature information from a global and local perspective; the channels are then compressed and expanded using two convolution layers of 1 x 1, the result being activated by a sigmoid function; after multiplying the activation result by the input feature, the left and right features are added to obtain the final output.

9. The remote sensing image building change detection method based on the characteristic enhancement network according to claim 3, wherein the method comprises the following steps: when the number of channels is adjusted by convolution of 1 multiplied by 1, the mechanism can correlate pixels at different positions according to self-attention, and the building can be well identified; therefore, the self-attention feature fusion module works as follows: taking the subtraction branch as a query vector, taking the addition branch as a key vector, and taking the splicing branch as a value vector; performing reshape and transpose operations, and performing reshape on the subtraction output; then multiplying them, activating the multiplication result by using a sigmoid function, and fusing the characteristic information of the spliced branches by using two convolution layers; then multiplying the result by a final weight matrix to construct a space long-distance dependency relationship, and carrying out reshape transformation on the multiplication result; finally, the result and the two input features are added element by element, and the final output is considered as a weighted sum of the input features and all the position features.

10. The remote sensing image building change detection method based on the characteristic enhancement network according to claim 1, wherein the method comprises the following steps: in the cross-channel up-down Wen Yuyi aggregation module, the middle feature diagram is obtained by channel splicing of left features and right features, the size of the middle features is compressed into 1×1 through a self-adaptive average pooling layer, the channel number is adjusted through a 1×1 convolution layer, convolution output and the channel number are spliced, a splicing result is transmitted to the 1×1 convolution layer, the convolution result is transmitted to a right branch for connection and fusion, a sigmoid function is used for normalizing a weight matrix, and aggregation and fusion of channel information can be fully realized through the left branch and the right branch; extracting multi-scale features of left and right branch features using two convolution layers; and finally, multiplying the weight matrix by the convolution result, and adding the weight matrix element by element to ensure that the channel information of the output feature images realize good intersection.