CN114821342A

CN114821342A - Remote sensing image road extraction method and system

Info

Publication number: CN114821342A
Application number: CN202210623810.7A
Authority: CN
Inventors: 王勇; 曾祥强
Original assignee: Institute of Geographic Sciences and Natural Resources of CAS
Current assignee: Institute of Geographic Sciences and Natural Resources of CAS
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-07-29
Anticipated expiration: 2042-06-02
Also published as: CN114821342B

Abstract

The invention discloses a method and a system for extracting a remote sensing image road, which relate to the field of image processing, and the method comprises the following steps: acquiring a remote sensing image to be interpreted; inputting the remote sensing image to be interpreted into a road extraction model, and extracting road information of the remote sensing image to be interpreted; the determination method of the road extraction model comprises the following steps: constructing an SGF-Net model; training the SGF-Net model to obtain an initial extraction model; and compressing the initial road extraction model by adopting a knowledge distillation strategy to obtain the road extraction model. The SGF-Net model comprises an encoder, a global information perception network and a decoder which are connected in sequence; the encoder is connected with the decoder; the encoder is built based on a spatial attention mechanism. The method can quickly and accurately extract the road information from the remote sensing image.

Description

Remote sensing image road extraction method and system

Technical Field

The invention relates to the field of image processing, in particular to a method and a system for extracting a remote sensing image road.

Background

With the vigorous development and wide application of remote sensing satellite and unmanned aerial vehicle technologies, how to extract road information from high-resolution remote sensing images becomes an important research topic at present. However, the conventional method for visually interpreting the road extraction is time-consuming and labor-consuming, and has low automation degree. Therefore, the road extraction method for the high-resolution remote sensing image, which is fast, automatic and high in performance, can effectively improve the road information acquisition efficiency and the extraction precision.

Most of the traditional road extraction methods utilize ideas based on pixels, object-oriented objects and the like to acquire road information of high-resolution remote sensing images. The pixel-based method mainly depends on remote sensing images with different wave spectrums, can only perform a road extraction task under a simple background by constructing the characteristics of the road such as the geometric shape and the like, but when facing the remote sensing images with more complex background information, the extraction result often has a serious salt and pepper noise phenomenon. The object-oriented method divides the remote sensing image into classification units with different sizes, and then extracts road information, so that the method has good noise resistance and applicability. However, due to the influence of building shadow shielding and similar object confusion, the method often has the phenomenon of fracture or adhesion in the road extraction result. Meanwhile, for remote sensing images with rich spectral information, the traditional road extraction method is difficult to consider the road characteristics with diversity and high complexity, cannot adapt to the road extraction task with large range and high precision, and the extraction effect needs to be further improved.

In recent years, with the development of Deep Learning (DL), classification and identification of different surface features and feature extraction work by using a Convolutional Neural Network (CNN) have been greatly improved, and the Deep Learning has a very wide application prospect. The CNN can autonomously learn the characteristics of the spectrum, the geometry, the shape and the like of the ground feature according to the input remote sensing image, overcomes the defect of manually constructing the characteristics by the traditional method, and is widely applied to road extraction tasks. Particularly, end-to-end CNN is used for coding input data by using encoders with different hierarchical structures, learning and extracting semantic features of roads, and has the remarkable advantage of automatic acquisition of shallow features; and meanwhile, the decoder is used for decoding the acquired semantic features step by step to restore the spatial resolution of the deep features. In order to overcome the difficulty that the spatial detail information of deep features in a decoder is difficult to recover, the idea based on jump connection is introduced to perform feature fusion operation among different levels, so that shallow features with rich spatial detail information are fully utilized, deep features with more detailed semantic information are generated, and a better extraction effect is obtained in a road extraction task. Therefore, the end-to-end CNN road extraction method becomes one of new research hotspots.

Although the end-to-end-based CNN road method has outstanding advantages in distinguishing roads from other features, the following problems still exist in the previous road extraction work: (1) due to the complexity and diversity of roads in the high-resolution remote sensing image, a simple convolution operator cannot pay attention to the spatial relationship between each feature point sharply, and cannot completely sense the global semantic information of road features, so that the road extraction effect is poor. (2) The idea of connection through jumping based on end-to-end CNN enhances the utilization degree of the detail information of the shallow feature space. However, because of the large semantic difference between the shallow layer features and the deep layer features, if the shallow layer and deep layer features are simply fused in a traditional channel stacking manner, redundancy between features of different levels may be ignored, and thus effective propagation of beneficial features such as spatial information and semantic information between an encoder and a decoder is limited. (3) With the increasing number of convolution layers of the CNN, although the recognition performance and the extraction accuracy of the road extraction model are improved to a certain extent, the CNN with too many convolution layers has a large number of network parameters and computational complexity, has a high time cost, seriously hinders the reasoning speed of the road extraction model prediction, and is not beneficial to developing the road extraction task of the large-scale high-resolution remote sensing image.

Disclosure of Invention

Based on this, the embodiment of the invention provides a method and a system for extracting a road from a remote sensing image, so as to quickly and accurately extract road information from the remote sensing image.

In order to achieve the purpose, the invention provides the following scheme:

a remote sensing image road extraction method comprises the following steps:

acquiring a remote sensing image to be interpreted;

inputting the remote sensing image to be interpreted into a road extraction model, and extracting road information of the remote sensing image to be interpreted;

the determination method of the road extraction model comprises the following steps:

constructing an SGF-Net model; the SGF-Net model comprises an encoder, a global information perception network and a decoder which are connected in sequence; the encoder is connected with the decoder; the encoder is constructed based on a spatial attention mechanism; the encoder is used for extracting shallow features of the input remote sensing image to obtain a first feature map; the global information perception network is used for extracting deep features of the first feature map to obtain a second feature map; the decoder is used for decoding and fusing the shallow features in the first feature map and the deep features in the second feature map to obtain road information of the input remote sensing image; the shallow feature comprises road spectrum information and road geometric shape information; the deep features include road semantic information;

training the SGF-Net model by adopting a remote sensing training image and corresponding road information, and determining the trained SGF-Net model as an initial extraction model;

and compressing the initial road extraction model by adopting a knowledge distillation strategy to obtain the road extraction model.

The invention also provides a remote sensing image road extraction system, which comprises:

the data acquisition module is used for acquiring a remote sensing image to be interpreted;

the road extraction module is used for inputting the remote sensing image to be interpreted into a road extraction model and extracting road information of the remote sensing image to be interpreted;

the road extraction module specifically comprises: a model determination submodule;

the model determining submodule is used for determining a road extraction model; the model determination submodule specifically includes:

the model building unit is used for building an SGF-Net model; the SGF-Net model comprises an encoder, a global information perception network and a decoder which are connected in sequence; the encoder is connected with the decoder; the encoder is constructed based on a spatial attention mechanism; the encoder is used for extracting shallow features of the input remote sensing image to obtain a first feature map; the global information perception network is used for extracting deep features of the first feature map to obtain a second feature map; the decoder is used for decoding and fusing the shallow features in the first feature map and the deep features in the second feature map to obtain road information of the input remote sensing image; the shallow feature comprises road spectrum information and road geometric shape information; the deep features include road semantic information;

the model training unit is used for training the SGF-Net model by adopting a remote sensing training image and corresponding road information and determining the trained SGF-Net model as an initial extraction model;

and the model compression unit is used for compressing the road initial extraction model by adopting a knowledge distillation strategy to obtain the road extraction model.

Compared with the prior art, the invention has the beneficial effects that:

the embodiment of the invention provides a method and a system for extracting a remote sensing image road, wherein a space attention mechanism and a global information perception network are utilized to construct an SGF-Net model, the space information expression capability of a shallow feature is improved, deep features comprising road semantic information are obtained, a decoder in the SGF-Net model completes effective fusion of the space information of the shallow feature and the semantic information of the deep feature, and therefore an initial extraction model obtained by training the SGF-Net model can accurately extract road information from the remote sensing image; the road initial extraction model is compressed by adopting a knowledge distillation strategy to obtain the road extraction model, so that the network parameters and the calculation complexity of the road initial extraction model are reduced, and therefore, the road extraction model can quickly and accurately acquire the road information in the remote sensing image.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a schematic diagram of a general concept of a remote sensing image road extraction method provided by an embodiment of the present invention;

fig. 2 is a flowchart of a method for extracting a remote sensing image road according to an embodiment of the present invention;

fig. 3 is a flowchart of a method for determining a road extraction model according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an SGF-Net model according to an embodiment of the present invention;

fig. 5 is a structural diagram of an infrastructure residual learning unit of a coding block according to an embodiment of the present invention;

FIG. 6 is a block diagram of a spatial attention mechanism layer provided by an embodiment of the present invention;

fig. 7 is a structural diagram of a global information-aware network according to an embodiment of the present invention;

FIG. 8 is a block diagram of a decoding block according to an embodiment of the present invention;

FIG. 9 is a block diagram of a feature fusion module provided in an embodiment of the present invention;

FIG. 10 is a schematic diagram illustrating a multi-level knowledge distillation learning strategy provided by an embodiment of the present invention;

FIG. 11 is a schematic diagram of an example of a road data set provided by an embodiment of the invention;

fig. 12 is a schematic diagram of extraction results of five classical convolutional neural network models on a Deep global road data set according to an embodiment of the present invention;

FIG. 13 is a schematic diagram of road extraction results of five models in the Massachusetts test set according to the embodiment of the present invention;

fig. 14 is a schematic diagram of a road extraction result of the five road extraction models provided in the embodiment of the present invention on the jingjingcheng test set;

fig. 15 is a schematic diagram of a road extraction result according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The general concept of the method for extracting a remote sensing image road provided by the embodiment is shown in fig. 1. Firstly, a large number of remote sensing image samples (remote sensing training images) are utilized to train an SGF-Net road extraction model, the precision and the effect of road extraction are evaluated in a test data set objectively, then the network parameters of the trained SGF-Net model (initial extraction model) are reduced by using the constructed multi-level knowledge distillation learning strategy, and the extraction of road information in the remote sensing images to be interpreted is completed quickly and accurately. The method for extracting a remote sensing image road according to the present embodiment will be described in detail below.

Referring to fig. 2, the method of the present embodiment includes:

step S1: and acquiring the remote sensing image to be interpreted.

Step S2: and inputting the remote sensing image to be interpreted into a road extraction model, and extracting road information of the remote sensing image to be interpreted.

Referring to fig. 3, the determination method of the road extraction model includes:

step 201: constructing an SGF-Net model; the SGF-Net model comprises an encoder, a global information sensing network and a decoder which are connected in sequence; the encoder is connected with the decoder; the encoder is constructed based on a spatial attention mechanism. The encoder is used for extracting shallow features of the input remote sensing image to obtain a first feature map; the global information perception network is used for extracting deep features of the first feature map to obtain a second feature map; the decoder is used for decoding and fusing the shallow features in the first feature map and the deep features in the second feature map to obtain road information of the input remote sensing image; the shallow feature comprises road spectrum information and road geometric shape information; the deep features include road semantic information.

Step 202: and training the SGF-Net model by adopting a remote sensing training image and corresponding road information, and determining the trained SGF-Net model as an initial extraction model.

Step 203: and compressing the initial road extraction model by adopting a knowledge distillation strategy to obtain the road extraction model.

Referring to fig. 4, the following describes each network structure in the SGF-Net model in step 201 in detail.

(1) Encoder for encoding a video signal

Referring to fig. 4, the encoder specifically includes: a plurality of sequentially connected coding units. The coding unit comprises a coding block and a spatial attention mechanism layer which are connected in sequence.

The encoder autonomously learns potential information such as spectrum information and road geometric shape information of a road in the remote sensing image by applying a plurality of coding blocks and a spatial attention mechanism layer and generates shallow layer characteristics of different levels; the global information perception network respectively perceives the feature areas with different ranges and the long-distance relation between each feature point by using the expansion convolution and the self-attention unit, and effectively gathers the context information (semantic information) of the road features; on one hand, the decoder fully utilizes the spatial information of the shallow features and the semantic information of the deep features by using the feature fusion module, on the other hand, the decoder recovers the spatial resolution of the road features step by adopting the decoding block, and finally, a road extraction result binary image is output. Specifically, the method comprises the following steps:

the encoder adopts ResNet-34 as a backbone extraction network for road shallow feature extraction, and utilizes a spatial attention unit to further highlight potential information such as spectrum, texture and shape of a road in the shallow feature. To avoid excessive downsampling leading to a large loss of spatial detail information, the step size convolution and pooling layers of the ResNet-34 initial layer are removed, leaving the encoded blocks in the remaining four stages. Therefore, the encoder in the present embodiment includes four sequentially connected encoding units, i.e., has four encoding blocks.

Coding block

The coding block is constructed based on a ResNet-34 network; the coding block is used for coding the input characteristics to obtain coding characteristics.

Fig. 5 shows a basic structure residual error learning unit of a coding block, which extracts road characteristic information by using two consecutive 3 × 3 convolutions and learns the difference between input and output by an identity mapping mechanism, thereby not only extracting shallow road characteristics with relatively rich semantic information, but also protecting the integrity of input information.

Although the residual error learning unit can simplify the training difficulty of the model and has better learning capability of the shallow feature, the shallow feature cannot accurately express the spatial information of the road feature due to the characteristics of complex road feature, interference of background information and standard convolution local perception in the remote sensing image. Therefore, the spatial distribution relationship between the feature points is heavily learned by using a spatial attention mechanism layer as shown in fig. 6, and the expression of the road features is highlighted.

② space attention-promoting layer

Referring to fig. 6, the spatial attention mechanism layer includes a pooling layer and a first rolled layer having a size of 7 × 7 connected in sequence; the pooling layer is used for respectively carrying out maximum pooling and average pooling on the input coding features to obtain distribution information of the road features in spatial dimensions; the first convolution layer is used for determining the spatial distribution relation of the road characteristics according to the distribution information and distributing weight to each characteristic point to obtain a spatial attention characteristic diagram; and multiplying the spatial attention feature map by the coding feature points, and adding the obtained product to the coding features to obtain coding output features. Specifically, the spatial attention mechanism layer first inputs x ∈ R ^C×H×W And carrying out average pooling and maximum pooling in the channel dimension to obtain the distribution information of the road characteristics in the space dimension. Then use7 multiplied by 7 convolution and Sigmoid activation functions autonomously learn the spatial distribution relation of road characteristics, optimally distribute weight to each characteristic point to obtain a spatial attention characteristic diagram, and finally obtain a code output characteristic y belonging to R after matrix point multiplication and characteristic addition with an input characteristic x ^C×H×W The expression of road characteristics on a spatial hierarchy is highlighted, and the influence of irrelevant information is ignored.

The encoding block input characteristic of a first encoding unit in the encoder is an input remote sensing image; the encoding output characteristic of the last encoding unit of the encoder is the first characteristic diagram.

The semantic information of the road is highly abstractly condensed by the first feature diagram finally obtained by the encoder, but the road in the remote sensing image often has the characteristics of scale change, strong connectivity and the like, so that the extraction scale is not uniform and the global capability perception is weak. Therefore, the embodiment utilizes the global information-aware network to capture the multi-scale contextual global semantic information as much as possible. The following describes a global information aware network.

(2) Global information aware network

Referring to fig. 7, the global information-aware network specifically includes: a dilation convolution unit and a self-attention unit.

Expanding convolution unit

The expansion convolution unit comprises five expansion convolution layers which have different expansion rates and are connected in sequence; the expansion convolution unit is used for extracting feature regions of different spatial ranges in the first feature map output by the encoder. Specifically, the method comprises the following steps:

the expansion convolution units respectively adopt 3 multiplied by 3 expansion convolution layers with expansion rates of {1,2,3,4 and 8} to sense the characteristic areas in different spatial ranges and effectively integrate the context information of the road characteristics.

Self-attention unit

The self-attention unit comprises three convolution layers connected in parallel; the self-attention unit is used for compressing the channel dimension of the first feature map output by the encoder to obtain the long-distance relation between feature points; and adding the long-distance relations between the feature areas and the feature points in different spatial ranges to obtain a second feature map. Specifically, the method comprises the following steps:

the self-attention unit compresses the channel dimension of input features by adopting three continuous 1 multiplied by 1 convolutions to capture the long-distance dependency between each feature point, and then performs matrix point multiplication on each feature to realize the correlation between the feature points with different space and channel dimensions and improve the attention degree of the model to the feature global information. In conclusion, the global information-aware network generates road features with denser semantic information by expanding convolution and self-attention, namely, a second feature map is obtained.

(3) Decoder

The decoder specifically includes: a plurality of sequentially connected decoding units. The decoding unit is in jump connection with the coding unit; the decoding unit comprises a decoding block and a feature fusion module which are connected in sequence; the feature fusion module is connected with a corresponding spatial attention mechanism layer in the encoder.

Decoding block

The decoding block is configured as shown in fig. 8, and is used to restore the spatial resolution of the road feature. The decoding block comprises a second convolution layer with the size of 1 multiplied by 1, a transposition convolution layer with the size of 3 multiplied by 3 and a third convolution layer with the size of 1 multiplied by 1 which are connected in sequence; the second convolution layer is used for reducing the dimension of the input features so as to reduce the calculated amount; the transposed convolution layer is used for expanding the width and the height of the feature after dimension reduction (for example, expanding twice the width and the height at the same time) so as to restore spatial detail information; and the third convolution layer is used for performing dimension increasing on the enlarged features to obtain decoding features, so that the number of feature channels is reduced, and the calculation amount is reduced. Wherein, the characteristic of the decoded block input of the first decoding unit in the decoder is the second characteristic diagram.

Although the end-to-end CNN is used for complementary utilization of different levels of feature information in a jump connection mode, the precision of road extraction is improved, the feature fusion method of channel superposition does not carefully consider the contribution of each feature, neglects the semantic difference between shallow features and deep features, and simultaneously fails to well utilize the complementary information between the shallow features and the deep features, and may influence the propagation of beneficial features. Therefore, the embodiment designs a feature fusion module as shown in fig. 9, re-evaluates the importance degree of each feature, makes up the semantic gap between the shallow feature and the deep feature, and completes the fusion and utilization of the beneficial features to the maximum extent.

② characteristic fusion module

Referring to fig. 9, the feature fusion module includes a fusion layer and a one-dimensional convolution layer with a width of 5, which are connected in sequence; the fusion layer is used for adding the decoding characteristics output by the decoding block and the first characteristic diagram output by the encoder to obtain fusion characteristics; the one-dimensional convolutional layer is used for learning complementary information between the shallow layer features and the deep layer features according to the fusion features to obtain a channel attention map; the channel attention diagram is subjected to matrix point multiplication with the first feature diagram, the channel attention diagram is subjected to matrix point multiplication with the decoding feature row matrix, and the feature diagrams obtained after the two matrix point multiplications are added to obtain the encoding output feature; and the encoding output characteristic of the characteristic fusion module of the last decoding unit in the decoder is the road information of the input remote sensing image. Specifically, the method comprises the following steps:

the characteristic fusion module adds the characteristics to the shallow layer characteristics x belonging to R ^C×H×W And the deep feature y ∈ R ^C×H×W And performing primary feature fusion to obtain all road feature information of the two. Then compressing in spatial dimension, autonomously learning complementary information between shallow layer and depth features by using one-dimensional convolution with width of 5, and obtaining a channel attention diagram s ∈ R by applying a Sigmoid activation function ^C×1×1 So that the road features are highlighted and the background features are suppressed. And finally, matrix dot multiplication is carried out on the learned weight parameters and the shallow layer characteristics and the deep layer characteristics respectively, the semantic difference between the shallow layer characteristics and the deep layer characteristics is eliminated, and effective fusion of characteristics of different levels is completed and the calculated amount is reduced through a characteristic addition mode again. The calculation process of the feature fusion module is as follows:

wherein σ and Convd _1d Respectively representing an activation function and a one-dimensional convolution, f _m Is a matrix dot product. A characteristic diagram a epsilon R obtained by adding characteristics for the first time ^C×H×W Although there are spatial information of shallow features and semantic information of deep features, not all features are advantageous for extraction of road information. For this purpose, the embodiment designs the module to selectively learn important road features along the spatial dimension by the model through compression, convolution and activation steps, compress unnecessary background features, generate a road feature highly condensed channel attention diagram s, and measure the contribution degree of each feature point. Multiplying s by the shallow and deep features, respectively, makes up for the semantic gap between the two, so that the useful information can be utilized by the subsequent decoder.

Although the SGF-Net model provided in this embodiment obtains the road characteristics with more detailed spatial information and more detailed semantic information, and improves the road extraction effect, the ResNet34 in the encoder has a large amount of calculation and network parameters, which is not favorable for developing the task of quickly extracting the road information of the remote sensing image in a large scale. Based on this, the present embodiment utilizes a multilevel knowledge distillation learning strategy as shown in fig. 10 to compress an SGF-Net network, and obtains a compact road extraction model with a smaller parameter quantity, thereby obtaining road information quickly and accurately. The multi-level knowledge distillation learning strategy aims to transfer the characteristic learning capability and the class identification capability of a teacher network SGF-Net (a coder is ResNet34) to a student network SGF-Net18 (a coder is ResNet18), so that the aims of reducing network parameters and calculation amount are fulfilled, and meanwhile, the road extraction speed and the extraction precision of the teacher network SGF-Net are improved.

Based on this, in step 203, the road initial extraction model is compressed by using a knowledge distillation strategy to obtain the road extraction model, which specifically includes:

(1) and taking the initial road extraction model as a teacher network.

(2) And establishing a student network by adopting the teacher network.

(3) Respectively inputting the remote sensing training images into the teacher network and the student network, and training the student network by adopting the teacher network to obtain a trained student network by taking the minimum loss value under a knowledge distillation mechanism as a target based on class knowledge distillation and characteristic knowledge distillation; the loss values under the knowledge distillation mechanism comprise loss values of class knowledge distillation, loss values of characteristic knowledge distillation and model training loss values.

Referring to fig. 10, the following describes the category knowledge distillation and the feature knowledge distillation included in the multi-level knowledge distillation strategy constructed in this embodiment.

Class knowledge distillation as a conventional item of distillation learning, the class information knowledge recognized by the SGF-Net is transferred into the SGF-Net18 so that the road information extracted by the two is as consistent as possible. The calculation formula of the loss value of distillation of the class knowledge is as follows:

wherein N represents the number of pixels; f. of _KL (. -) represents the Kullback-Leibler (KL) divergence calculation; t is _i Representing the class identification result of the ith pixel by the teacher network; s _i And representing the class identification result of the ith pixel of the student network.

The categorical knowledge distillation utilizes the road extraction results of SGF-Net as a "soft label" so that SGF-Net18 can learn more categorical knowledge information. However, the class knowledge distillation only focuses on class knowledge information extracted from deep features, and ignores the spatial detail knowledge information which is rich in shallow features. Therefore, the present study uses eigenknowledge distillation, aiming to enable the encoder of SGF-Net18 to mimic the shallow eigenknowledge learning process of SGF-Net and force the former to focus better on spatial detail information of roads. Fig. 10 shows the feature learning process of two networks on different levels of coding blocks, and using equation (3) as a measure, the calculation formula of the loss value of class knowledge distillation is as follows:

wherein a is belonged to {1,2,3,4}, and represents an index of a coding block in an encoder (representing shallow features of different levels); j represents a channel index of the feature map; t is _a,j The characteristic diagram represents a j channel in shallow characteristics output by an a coding block in the teacher network; s _i,j The characteristic diagram represents a j channel in shallow layer characteristics output by an a coding block in a student network; i | · | purple wind ₂ Indicating L2 normalization.

In summary, the multilevel knowledge distillation strategy designed in this embodiment respectively learns the road information acquisition capability of the SGF-Net by the SGF-Net18 and improves the shallow feature extraction capability thereof.

In this embodiment, a cross entropy loss function is used to train a student model, and a calculation formula of a model training loss value in a training process is as follows:

wherein, y _i Representing the real category of the ith pixel; p is a radical of _i And the class prediction probability of the ith pixel of the student network is represented.

Therefore, a student model is trained by adopting a multi-level knowledge distillation learning strategy, and the loss value (total loss value) under the knowledge distillation mechanism is calculated according to the formula:

L _total ＝L _CE +0.1*L _KL +0.05*L _F ；

L _total represents the loss value under the knowledge distillation mechanism; l is _KL A loss value representing distillation of the class knowledge; l is _F Loss values representing distillation of the characteristic knowledge; l is _CE Representing model training loss values. In a distillation learning mode, the SGF-Net18 model is subjected to differentiation comparison with a real label to optimize network parameters on one hand, and can be used for improving generalization capability with the help of SGF-Net on the other hand.

(4) Determining the trained SGF-Net18 (student network) as the road extraction model.

In one example, referring to fig. 4, the SGF-Net model further includes: an initial block and an output block; the encoder previously concatenates the initial blocks; the decoder is then connected to the output block.

The initial block includes: a fourth convolution layer of size 1 × 1, a bulk normalization layer, and a ReLU activation function connected in sequence. The output block includes: a fifth convolutional layer of size 3 × 3, a ReLU activation function, a sixth convolutional layer of size 3 × 3, and a Softmax activation function, which are connected in this order.

In the embodiment, an initial extraction model (a trained SGF-Net model) integrating spatial attention, global information perception and a feature fusion module is provided on the basis of a terminal-to-terminal Convolutional Neural Network (CNN) for extracting road features with more detailed spatial information and more comprehensive semantic information and accelerating the reasoning speed of road information extraction, and a knowledge multilevel distillation learning strategy is designed. The model improves the spatial information expression capability of shallow features and acquires multi-scale context information by using spatial attention and a global information perception network. In order to eliminate semantic difference between shallow and deep features in end-to-end CNN, a feature fusion module considering space and semantic information is constructed, and effective fusion of shallow feature space information and deep feature semantic information is completed by utilizing one-dimensional convolution to autonomously learn feature parameters and optimizing and distributing weights from channel dimensions. And a multi-level knowledge distillation strategy is used for training the model, so that the network parameters and the calculation complexity of the initial extraction model are reduced, and the road information in the remote sensing image is quickly and accurately acquired. Model training, verification and evaluation are sequentially carried out on the public Deep Global and Massachusetts two satellite remote sensing image road data sets and the self-made Beijing New City unmanned aerial vehicle remote sensing image road data set. Experimental results show that the road extraction model of the embodiment is a road extraction model with high extraction precision and good extraction effect, and has good extraction capability on road information of satellite and unmanned aerial vehicle remote sensing images. Meanwhile, the multi-level knowledge distillation learning strategy can remarkably improve the accuracy and generalization capability of the model, and the model accuracy and the network parameters are well balanced, so that the method has a wide application prospect.

A specific example is given below to verify the validity of the road extraction model in the above embodiment.

1. Road data set and preprocessing

The example uses three road data sets in common, evaluates the extraction effect and the extraction precision of road information of a road extraction model (a trained SGF-Net model) on unmanned aerial vehicles and satellite remote sensing images, tests the integrity and the accuracy of road information extraction under the conditions of high spatial resolution, complex urban scenes, background information interference and the like, and tests the road extraction performance of a student model (SGF-Net18) trained by a multi-level knowledge distillation learning strategy on large-amplitude remote sensing images. An example of a road data set is shown in fig. 11.

Part (a) of fig. 11 is a homemade unmanned aerial vehicle remote sensing image road data set, and the collection area is located in kynjin new city, tianjin, with a spatial resolution of 0.05 m. Due to the large spatial resolution and the limitation of computer video memory, the unmanned aerial vehicle remote sensing image is firstly cut into 1024 × 1024 pixels and then is zoomed in to 256 × 256 pixels, so that 1373 pairs of training images and 345 pairs of test images are obtained.

Part (b) of fig. 11 is a Deep global road data set providing 6226 pairs of satellite remote sensing images with a spatial resolution of 0.5 m and a pixel size of 1024 × 1024. The road characteristics are narrow and long, the connectivity is strong, the background object interference is large, and the like, and the method is very challenging in the field of automatic road extraction based on deep learning. In view of more data and video memory limitation, the example screens out the data meeting the requirements through the following steps: 1) cutting each pair of remote sensing images and corresponding labels into 512 × 512 pixels, and scaling to 256 × 256 pixels; 2) setting the road pixel value as 1 and the non-road as 0, calculating the sum of the pixel values of the label images, and reserving the image larger than 5000. Through the processing steps, 2188 training images and 1431 test images are obtained in total.

Part (c) of fig. 11 is a massachusetts road data set 1171 having a spatial resolution of 1.2 meters for a satellite remote sensing image with a 1500 × 1500 pixel size. Because a large amount of blank areas exist in the remote sensing image, the remote sensing image and the corresponding label are firstly cut into 256 multiplied by 256 pixels, appropriate sample data is reserved according to the processing mode of a Deep Global data set, and 2230 training images and 161 pairs of test images are finally obtained.

2. Details of the experiment

The model designed by this example was constructed using Python 3.7 programming language and the pytorech 1.7 deep learning framework, and all experiments were completed in the Centos 7 system. In the 100 training processes of the model, the batch size is set to be 16, parameter iterative optimization is performed by using an Adam optimizer with the learning rate of 0.0001, and the model training speed is accelerated by using two NVIDIA RTX 2080 Ti GPUs.

3. Evaluation index

For objective quantitative analysis of the extraction results, the road extraction Accuracy of the model was comprehensively evaluated using the Overall Accuracy (OA), Accuracy (Precision, P), Recall (Recall, R), F1 score (F1-score, F1) and Intersection unit (IOU).

Discussion 4

4.1 model comparison analysis

In order to fully check the performance and the precision of the road extraction model provided by the embodiment, the comparison analysis of the classical image segmentation models such as U-Net, LinkNet, SegNet, BiSeNet and D-LinkNet is used. And performing iterative training on model parameters by using the same learning rate and an optimizer on the three road data sets, and comprehensively analyzing the extraction precision of each model by adopting an evaluation index.

4.1.1 visualization of road extraction results

Fig. 12 shows the extraction results of five classical convolutional neural network models on Deep global road data set. As can be seen from the figure, although the U-Net, LinkNet, D-LinkNet and SegNet models can identify partial road information, the phenomena of extraction omission of roads and error identification of background objects still exist. In the solid line ellipse of the 1 st to 2 nd rows, the model such as U-Net has more serious conditions of missing extraction and misclassification due to the similar spectral characteristics of the road and background information. However, the extraction result shows that the SGF-Net extracts the road information more completely, overcomes the background information interference of similar spectral features, and shows the effectiveness of space attention on the prominent road features. Meanwhile, as can be seen from the solid line ellipses in the lines 1-2, the SGF-Net can also identify the road information which is not marked in the label, which shows that the proposed model has strong road feature extraction capability. In addition, as can be seen from the solid line ellipses in the 3 rd to 4 th rows, in the case that the road is shielded by trees, the road extraction result of the classical image segmentation model is still unsatisfactory, and serious adhesion, bending and fracture phenomena exist. However, the SGF-Net captures multi-scale global context semantic information through a global information perception and feature fusion module, obtains rich spatial detail information, and obtains a satisfactory road extraction visual effect.

Figure 13 shows the road extraction results for several models on the massachusetts test set. Wherein, parts (a) and (b) of fig. 13 represent the inputted images in the massachusetts test set and the corresponding label images, and parts (c) - (g) of fig. 13 represent the road extraction results of U-Net, LinkNet, D-LinkNet, SegNet and SegNet on the massachusetts test set. As can be seen from fig. 13, the SGF-Net proposed in this example exhibits a better road extraction effect than the classical road extraction model. As can be seen from the solid line ellipses in the lines 1-3, under the condition that the road is seriously shielded by the trees, the road extraction results of U-Net, LinkNet, D-LinkNet and SegNet have more phenomena of extraction omission and road fracture, and the extraction effect is poor. However, the SGF-Net model still shows good road extraction performance under such a condition, which not only overcomes the difficulty of tree shielding, but also shows good road communication effect. In addition, the dotted line ellipses in the rows 2-3 show that even if the label fails to mark the road shielded by the tree, the SGF-Net model can still excellently complete the road extraction task through the global information perception network, and the defect of road breakage in the classical model is avoided. In a complex urban road scene, like the solid line ellipse in the 4 th line, the road has the characteristics of multiple parallels and curved radian, and the classical models have the phenomena of incomplete extraction results and inaccurate road positioning. The SGF-Net model highlights road characteristics and ignores irrelevant characteristics through a space attention and characteristic fusion module, and well extracts road information in a complex scene.

Fig. 14 shows the road extraction results of the five road extraction models on the kyford new city test set. Parts (a) and (b) in fig. 14 show the input images and corresponding label images in the kynjin newcastle test set, and parts (c) - (g) in fig. 13 show the road extraction results of U-Net, LinkNet, D-LinkNet, SegNet and SegNet on the kynjin newcastle test set. As can be seen from fig. 14, roads in the data set mainly exhibit the characteristics of large road pixel occupation ratio and high spatial resolution, and great challenges and difficulties are brought to the road extraction task in the aspects of complete result extraction and accurate boundary position. In the dashed ellipse in line 1, although the classical model can extract partial road information, there are cases where the road is extracted erroneously and is omitted as compared with the SGF-Net. U-Net extracts the road information more completely in the 4 th test image, but does not identify the edge information of the road accurately enough (such as the solid line ellipses in lines 1-3). LinkNet has better extraction effect than the U-Net model in line 3, but has serious defect in the integrity of road extraction result. D-LinkNet, as an improved model of LinkNet, seems to be better than LinkNet in road edge recognition and integrity, but the phenomena of 'holes' and false recognition still exist in the extraction result. Although the extraction result of the SegNet model in the 3 rd row is slightly better than that of the U-Net model, the SegNet model has a large number of phenomena of road breakage and extraction omission, and reflects that the extraction performance of the model is poor. However, the SGF-Net integrates the advantages of the spatial attention, the global information perception and the feature fusion module, effectively identifies the edge information of the road, extracts the road information more completely and shows better road extraction performance.

4.1.2 precision evaluation

Table 1 quantitatively evaluates the road extraction accuracy results between different models under different data sets, wherein the Res-UNet model replaces the encoder of U-Net with ResNet 18. As can be seen from the table, the SGF-Net proposed in this example achieves the best performance in the road extraction field compared with the traditional model. Under Deep Globe data set, OA, F1 scores and IOU of all models exceed 91%, 63% and 46%, respectively, demonstrating that the road extraction method based on Deep learning can effectively extract road information. Compared with a LinkNet model, the SGF-Net provided by the method improves the F1 score and the IOU by 6.22% and 8.13% respectively, and shows that the introduction of a spatial attention, global information perception and feature fusion module can effectively improve the precision of the model and obtain good performance on a test set. Similar results were also found on the classical massachusetts road dataset. The F1 score and IOU for SGF-Net were improved by 2.01% and 2.7%, respectively, compared to Res-UNet, which is the second precision, again showing the effectiveness of the feature fusion module. Meanwhile, the evaluation result of the classic U-Net is superior to that of LinkNet, and the possible reason is that LinkNet excessively performs down-sampling in an initial layer, so that a great amount of spatial detail information of a remote sensing image is lost, the precision is reduced, and the reasonability of SGF-Net initial layer design is reflected from the side. Even on the Beijing New City unmanned aerial vehicle remote sensing road data set with larger spatial resolution and stronger challenge, the model provided by the embodiment also obtains better precision. In terms of OA, the SGF-Net model reached 97.58%, which is much higher than other models, indicating that the best accuracy in road and background discrimination was achieved.

TABLE 1 evaluation results of road extraction for each model under different data sets

From the visual analysis and the precision evaluation of the road extraction result, it can be easily found that the SGF-Net extracts spatial information with rich details and multi-scale context semantic information through a spatial attention, global information perception and feature fusion module, and effectively aggregates road features of different levels. Therefore, the SGF-Net provided by the embodiment has good visual extraction effect and extraction precision on unmanned aerial vehicles and satellite remote sensing image road extraction tasks, and particularly has good identification performance on tree sheltering, complex roads and road areas with labels omitted.

4.2 ablation experiments

To fully validate the rationality of each module design in the SGF-Net, this example performed ablation experiments on three datasets and quantitatively analyzed using F1 scores and IOU assessment indices. As can be seen from table 2, the precision evaluation results of the ablation experiments showed consistent trends under different data sets: as the modules increase in sequence, the F1 score and IOU also increase gradually, indicating that each module can improve the accuracy of the model. Taking Deep Globe dataset as an example, after spatial attention is added, the F1 score and the IOU are respectively improved by 0.48% and 0.65% compared with the baseline model, which indicates that the module can focus on road features and disregard other irrelevant background information. The global information-aware network obtains multi-scale context semantic information by expanding convolution and self-attention units, and compared with a baseline model, the global information-aware network respectively brings 0.54% and 0.73% of F1 scores and IOU improvement. By introducing the feature fusion module, the semantic difference between the shallow feature and the deep feature is made up, and the spatial information of the shallow feature and the semantic information of the deep feature are fully utilized, so that the F1 score and the IOU are respectively improved by 0.6% and 0.81%. Ablation experiments under different data sets prove that the spatial attention, global information perception and feature fusion module can improve the road extraction performance of the model.

TABLE 2 evaluation results of ablation experiments under different data sets

4.3 knowledge distillation learning effect

Table 3 shows a comparison of parameters between the teacher network SGF-Net and the student network SGF-Net 18. As can be seen from the table, the parameter quantity and 10 hundred million floating point Operands Per Second (GFLOPs) of the SGF-Net18 are much lower than those of the SGF-Net, which are respectively reduced by 46.10% and 46.46%, and the computational complexity is greatly reduced. Through tests, the throughput per second of the SGF-Net18 on 256 × 256 pixel images is 37.32 (namely, the road information extraction of 37.32 images can be completed per second), which is 1.7 times of that of the SGF-Net, and the reasoning speed of model prediction is greatly improved.

TABLE 3 comparison of model parameters

The calculation amount and time cost of the SGF-Net18 are much lower than those of the teacher network SGF-Net, but the precision is not good. Therefore, the example transfers the class information and the characteristic knowledge learning capability of the SGF-Net to the SGF-Net18 through a multi-level knowledge distillation learning strategy, and improves the road extraction precision and the generalization capability. As can be seen from Table 4, the SGF-Net18 has different degrees of improvement in F1 score and IOU over the three test sets with the help of the teacher network SGF-Net. This shows that the road extraction model trained by the multi-level knowledge distillation strategy achieves a good balance between accuracy and inference speed without losing excessive accuracy. Taking Deep Globe as an example, after SGF-Net18 is subjected to distillation learning, the F1 fraction and IOU are respectively improved by 0.22% and 0.29%. From the result, on one hand, the road extraction precision of the small model can be improved through a knowledge distillation strategy, on the other hand, a compact model can be obtained through distillation learning, the road information can be extracted quickly and accurately, and the time efficiency is improved.

Table 4 evaluation of knowledge distillation learning effect under different data sets

4.4 Beijing Jin New City road extraction

In order to quickly and accurately acquire road information, the example utilizes 4.3 sections of SGF-Net18 obtained by a multi-level knowledge distillation learning strategy to carry out the task of extracting the unmanned aerial vehicle remote sensing image road. Fig. 13 shows an unmanned aerial vehicle remote sensing image in kyazine new city, the grid size of which is 140139 × 184139, and the extraction process is as follows: (1) in view of the fact that large-size remote sensing images cannot be directly written into a computer memory, the kyujin new city image is divided into 5 rows and 6 columns of image subsets, and data reading and writing are facilitated. (2) And sequentially extracting the road information of the image subsets by using an SGF-Net18 model. Since the image subset still has a large size, the model cannot directly extract the road information. Therefore, the present example generates 256 × 256 small-size images for each image subset in a 3.1-section processing manner, while avoiding the influence of the edge value using the sliding step prediction. (3) Inlaying the road prediction results of the image subsets by applying ArcGISI 10.6 software, wherein the final road extraction result is shown as a black solid line in FIG. 15, and a part (a) in FIG. 15 shows a global road extraction result graph; fig. 15 (b) is a partial enlarged view, and fig. 15 (c) is another partial enlarged view.

From the road extraction result in fig. 15, it can be found that the SGF-Net18 based on the multi-level knowledge distillation learning strategy can extract most of the road information more accurately, and has strong recognition capability particularly in the areas where the road linear characteristics are obvious and the trunk road is located. From the local enlarged view, the SGF-Net18 not only completely acquires the road information of the urban residential area, but also more accurately identifies the road and edge information of the main road area, and excellently completes the road extraction task based on the unmanned aerial vehicle remote sensing image. It is noted that, according to the throughput of SGF-Net18 in section 4.3, the theoretical time for extracting the whole image area is about 0.18h, but about 5h is actually consumed. The main reason is that the time consumption is too large due to the steps of image framing, sequential image data reading, sliding step prediction, result storage and the like. From the result, the SGF-Net18 has the capability of fast extraction and accurate identification, and the great time cost is consumed when the SGF-Net is used for carrying out the Beijing Jin New City road extraction task.

From the extraction result, it can be seen that the SGF-Net18 based on the multi-level knowledge distillation learning strategy has a phenomenon of partial road misrecognition, which may be caused by the following two main reasons: (1) the Jingjin New City road training set only contains partial image data, which easily causes that CNN can not accurately learn the relevant characteristics of the road under the condition of sample data deficiency, thus causing the possible occurrence of under-fitting state of the model. (2) The SGF-Net18 uses ResNet18 as an encoder, and although the feature learning capability is enhanced by the distillation mechanism, the feature learning capability is not strong and the road extraction accuracy is not high due to the small number of convolution layers. Thus, in the future example field, on the one hand, the model generalization performance can be improved by increasing the training sample amount; on the other hand, a convolutional neural network with stronger service performance can be used as an encoder, and the model feature extraction capability is improved.

Therefore, the method for extracting the remote sensing image road has the following advantages:

the embodiment provides an initial extraction model (SGF-Net), and a multi-level knowledge distillation learning strategy is constructed, so that a road extraction model is obtained. After experiments are carried out on two satellite remote sensing image road data sets and one unmanned aerial vehicle remote sensing image road data set, the results show that: (1) SGF-Net is a road extraction model with high precision, can accurately extract road information with obvious linear characteristics, building shadows and tree shelters, and excellently completes the road extraction task of unmanned aerial vehicles and satellite remote sensing images. Through visual analysis, precision comparison and ablation experiments, the spatial attention and global information perception network is shown to improve the attention degree to road characteristics and perceive the characteristic information in a wider range; meanwhile, the feature fusion module adaptively makes up the semantic difference between the shallow feature and the deep feature, enriches the spatial information and semantic information of the road feature and obtains obvious performance gain. (2) Based on a multi-level knowledge distillation learning strategy, network parameters and calculation complexity of SGF-Net are reduced, the problem that road extraction speed is reduced due to the fact that the number of convolution layers is too large is solved, road extraction accuracy and generalization capability are improved, the method is well suitable for the task of extracting the remote sensing image road in a large scale, and reference significance is provided for extracting other ground feature information. The next step of work is to pay attention to the research of the lightweight model and the construction of a road data set of the unmanned aerial vehicle remote sensing image so as to quickly and accurately obtain road information and improve the generalization capability of the model.

The remote sensing image road extraction system of the embodiment adopts an initial extraction model (a trained SGF-Net model) based on an end-to-end CNN, integrated space attention, global information perception and feature fusion module, and designs a multi-level knowledge distillation learning strategy. The model focuses on the road characteristics of the spatial dimension and aggregates multi-scale context information by using a spatial attention mechanism and a global information perception module; the feature fusion module is adopted to consider semantic differences among different levels of features, and effective fusion between shallow feature space detail information and deep feature semantic information is completed from channel dimensions, so that more beneficial road information is obtained; meanwhile, a multi-level knowledge distillation learning strategy is used, the network parameters of the initial extraction model and the time cost for extracting the road are reduced, and the road information of the large-amplitude high-resolution remote sensing image is quickly obtained. Compared with a classical road extraction model, the road extraction model of the embodiment excellently completes the rapid, accurate and automatic extraction of the satellite and unmanned aerial vehicle remote sensing image road information, and has good application and popularization prospects.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A remote sensing image road extraction method is characterized by comprising the following steps:

acquiring a remote sensing image to be interpreted;

constructing an SGF-Net model; the SGF-Net model comprises an encoder, a global information sensing network and a decoder which are connected in sequence; the encoder is connected with the decoder; the encoder is constructed based on a spatial attention mechanism; the encoder is used for extracting shallow features of the input remote sensing image to obtain a first feature map; the global information perception network is used for extracting deep features of the first feature map to obtain a second feature map; the decoder is used for decoding and fusing the shallow features in the first feature map and the deep features in the second feature map to obtain road information of the input remote sensing image; the shallow feature comprises road spectrum information and road geometric shape information; the deep features include road semantic information;

2. The method for extracting a remote sensing image road according to claim 1, wherein the encoder specifically comprises: a plurality of sequentially connected encoding units;

the coding unit comprises a coding block and a space attention mechanism layer which are connected in sequence;

the coding block is constructed based on a ResNet-34 network; the coding block is used for coding the input characteristics to obtain coding characteristics;

the space attention mechanism layer comprises a pooling layer and a first coiling layer with the size of 7 multiplied by 7 which are connected in sequence; the pooling layer is used for respectively carrying out maximum pooling and average pooling on the input coding features to obtain distribution information of the road features in spatial dimensions; the first convolution layer is used for determining the spatial distribution relation of the road characteristics according to the distribution information and distributing weight to each characteristic point to obtain a spatial attention characteristic diagram; multiplying the spatial attention feature map by the coding feature points, and adding the obtained product to the coding features to obtain coding output features;

3. The method for extracting a remote sensing image road according to claim 2, wherein the global information-aware network specifically comprises:

a dilation convolution unit and a self-attention unit;

the expansion convolution unit comprises five expansion convolution layers which have different expansion rates and are connected in sequence; the expansion convolution unit is used for extracting feature regions of different spatial ranges in the first feature map output by the encoder;

the self-attention unit comprises three convolution layers connected in parallel; the self-attention unit is used for compressing the channel dimension of the first feature map output by the encoder to obtain the long-distance relation between feature points; and adding the long-distance relations between the feature areas and the feature points in different spatial ranges to obtain a second feature map.

4. The method for extracting a remote sensing image road according to claim 3, wherein the decoder specifically comprises:

a plurality of sequentially connected decoding units;

the decoding unit is in jump connection with the coding unit; the decoding unit comprises a decoding block and a feature fusion module which are connected in sequence; the feature fusion module is connected with a corresponding spatial attention mechanism layer in the encoder;

the decoding block comprises a second convolution layer with the size of 1 multiplied by 1, a transposition convolution layer with the size of 3 multiplied by 3 and a third convolution layer with the size of 1 multiplied by 1 which are connected in sequence; the second convolution layer is used for reducing the dimension of the input features; the transposed convolution layer is used for expanding the width and the height of the feature after dimension reduction; the third convolution layer is used for performing dimension increasing on the enlarged features to obtain decoding features; wherein, the characteristic of the decoded block input of the first decoding unit in the decoder is the second characteristic diagram;

the characteristic fusion module comprises a fusion layer and a one-dimensional convolution layer with the width of 5, which are sequentially connected; the fusion layer is used for adding the decoding characteristics output by the decoding block and the first characteristic diagram output by the encoder to obtain fusion characteristics; the one-dimensional convolutional layer is used for learning complementary information between the shallow layer features and the deep layer features according to the fusion features to obtain a channel attention map; the channel attention diagram is subjected to matrix point multiplication with the first feature diagram, the channel attention diagram is subjected to matrix point multiplication with the decoding feature row matrix, and the feature diagrams obtained after the two matrix point multiplications are added to obtain the encoding output feature; and the encoding output characteristic of the characteristic fusion module of the last decoding unit in the decoder is the road information of the input remote sensing image.

5. The method for extracting a remote sensing image road according to claim 1, wherein the compressing the initial road extraction model by using a knowledge distillation strategy to obtain the road extraction model specifically comprises:

taking the initial road extraction model as a teacher network;

establishing a student network by adopting the teacher network;

respectively inputting the remote sensing training images into the teacher network and the student network, and training the student network by adopting the teacher network to obtain a trained student network by taking the minimum loss value under a knowledge distillation mechanism as a target based on class knowledge distillation and characteristic knowledge distillation; loss values under the knowledge distillation mechanism comprise loss values of class knowledge distillation, loss values of characteristic knowledge distillation and model training loss values;

and determining the trained student network as the road extraction model.

6. The method for extracting a remote sensing image road according to claim 5, wherein a calculation formula of a loss value under a knowledge distillation mechanism is as follows:

L _total ＝L _CE +0.1*L _KL +0.05*L _F ；

L _total represents the loss value under the knowledge distillation mechanism; l is _KL A loss value representing distillation of the class knowledge; l is _F Loss values representing distillation of the characteristic knowledge; l is _CE Representing model training loss values.

7. The method for extracting a remote sensing image road as claimed in claim 6, wherein the calculation formula of the loss value of class knowledge distillation is as follows:

wherein N represents the number of pixels; f. of _KL (. -) represents the Kullback-Leibler divergence calculation; t is _i Representing the class identification result of the ith pixel by the teacher network; s _i Representing the class identification result of the ith pixel by the student network;

the calculation formula of the loss value of the characteristic knowledge distillation is as follows:

wherein a represents an index of a coding block in an encoder; j represents a channel index of the feature map; t is _a,j The characteristic diagram represents a j channel in shallow characteristics output by an a coding block in the teacher network; s _i,j The characteristic diagram represents a j channel in shallow layer characteristics output by an a coding block in a student network; i | · | purple wind ₂ Represents L2 normalization;

the calculation formula of the model training loss value is as follows:

wherein, y _i Representing the real category of the ith pixel; p is a radical of _i Indicating the ith of the student networkThe class of the pixel predicts the probability.

8. The method for extracting a remote sensing image road according to claim 1, wherein the SGF-Net model further comprises: an initial block; the encoder previously concatenates the initial blocks;

the initial block includes: a fourth convolution layer of size 1 × 1, a bulk normalization layer, and a ReLU activation function connected in sequence.

9. The method for extracting a remote sensing image road according to claim 1, wherein the SGF-Net model further comprises: an output block; the decoder is followed by the output block;

the output block includes: a fifth convolution layer of size 3 × 3, a ReLU activation function, a sixth convolution layer of size 3 × 3, and a Softmax activation function connected in this order.

10. The utility model provides a remote sensing image road extraction system which characterized in that includes: