CN110490081B

CN110490081B - Remote sensing object interpretation method based on focusing weight matrix and variable-scale semantic segmentation neural network

Info

Publication number: CN110490081B
Application number: CN201910660740.0A
Authority: CN
Inventors: 崔巍; 何新; 姚勐; 王梓溦; 郝元洁; 穆力玮; 马力; 陈先锋; 史燕娟; 胡颖; 申雪皎
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2019-07-22
Filing date: 2019-07-22
Publication date: 2022-04-01
Anticipated expiration: 2039-07-22
Also published as: CN110490081A

Abstract

The invention discloses a remote sensing object interpretation method based on a focusing weight matrix and a variable-scale semantic segmentation neural network, which comprises the following steps of: data acquisition and data preprocessing; making a thematic map; cutting a sample; designing a multi-spatial scale remote sensing image marking strategy; making a label of the sample set; constructing a multi-scale remote sensing image semantic interpretation model; selecting a training set and a verification set; setting training parameters; training a model; and designing a remote sensing object recognition algorithm based on a focusing weight matrix and verifying and analyzing the effect of the variable-scale remote sensing image semantic interpretation model. According to the invention, through constructing the LSTM, the obtained relation between the noun in the semantic description and the object mask graph obtained by semantic segmentation is transferred to the space between the object mask graphs, so that the variable-scale semantic segmentation and the end-to-end identification of the space relation of the remote sensing object are realized, and the image classification and identification work in the remote sensing application field is guided to step to a higher step.

Description

Remote sensing object interpretation method based on focusing weight matrix and variable-scale semantic segmentation neural network

Technical Field

The invention relates to the technical field of image processing, in particular to a remote sensing object interpretation method based on a focusing weight matrix and a variable-scale semantic segmentation neural network.

Background

Remote sensing image classification and remote sensing object identification are research hotspots of the current remote sensing technology, and along with the development of the artificial intelligence technology, the deep neural network is widely applied to high-resolution remote sensing image analysis and increasingly becomes an effective processing method.

At present, the conventional LSTM model based on the Attention mechanism is mainly applied to semantic description of common digital images, and the present inventors find that the method in the prior art has at least the following technical problems in the process of implementing the present invention:

uncertainty of spatial position: at different times, the focus area mechanism generates an image feature matrix with the size of 14 × 14, corresponding to 196 spatial positions in the remote sensing image, which often has some deviations, and the application of the focus area mechanism in remote sensing object identification is limited.

Boundary uncertainty: nouns (labels of objects) in the semantic description cannot accurately segment the boundaries of remotely sensed objects in the image and therefore cannot identify spatial relationships between objects.

Uncertainty of spatial scale: the peripheral information of the object is complex and changeable, the remote sensing object is difficult to identify through a model with a single scale, and sometimes the remote sensing object can be identified more accurately by semantic information with a larger scale.

Therefore, the method in the prior art has the technical problem of inaccurate identification.

Disclosure of Invention

In view of the above, the invention provides a remote sensing object interpretation method based on a focus weight matrix and a variable-scale semantic segmentation neural network, which is used for solving or at least partially solving the technical problem of inaccurate identification in the method in the prior art.

The invention provides a remote sensing object interpretation method based on a focus weight matrix and a variable-scale semantic segmentation neural network, which comprises the following steps:

step S1: acquiring a high-resolution remote sensing image of a preset research area, and preprocessing the acquired high-resolution remote sensing image;

step S2: vectorizing by using professional GIS software to obtain a thematic map layer of a research area, and rasterizing the vector thematic map to obtain a corresponding grid gray map;

step S3: cutting the preprocessed remote sensing image and the grid gray level image, and extracting two sets of data sample sets with spatial scales, wherein the two sets of data sample sets with the spatial scales respectively comprise an original image, a large-scale GT image, the original image and a small-scale GT image;

step S4: carrying out content annotation on each remote sensing image in the two sets of spatial scale data sample sets according to a multi-spatial scale remote sensing image annotation strategy to obtain sample set annotations;

step S5: constructing a variable-scale remote sensing image semantic interpretation model, obtaining a multi-scale semantic segmentation image through the interpretation model, extracting masks of two scale objects through a mask extraction algorithm, and associating a small-scale mask object segmented by a U-Net network with a noun in semantic description through variable-scale object identification, wherein the variable-scale remote sensing image semantic interpretation model comprises the following steps: the system comprises an FCN full convolution network, a U-Net semantic segmentation network and an LSTM network based on an Attention mechanism, wherein the FCN network is used for large-scale object segmentation, the U-Net network is used for small-scale object segmentation, and the LSTM is used for generating semantic description containing two space scale objects and space relation thereof;

step S6: training an FCN (fuzzy C-means) network, a U-Net semantic segmentation network and an LSTM (least Square TM) network in the constructed variable-scale remote sensing image semantic interpretation model to obtain a trained model;

step S7: the method for recognizing the remote sensing object by using the trained model specifically comprises the following steps: and positioning a focusing weight matrix of the noun generated by the LSTM network at the current moment to a corresponding small-scale object in a mask image obtained by the U-Net semantic segmentation, and finishing the identification of the object if the object class label is the same as the noun.

In one embodiment, when the object class label is different from the noun, the method further includes starting a multi-scale remote sensing object rectification algorithm, specifically: firstly, a large-scale mask object obtained by FCN semantic segmentation positioned in a current interest area is positioned through a scale-up method, and then a small-scale object with the same class label and noun is positioned in a candidate large-scale object through a scale-down method, so that the identification of the object is completed.

In one embodiment, the method further comprises: and performing effect verification analysis on the multi-scale remote sensing image semantic interpretation model.

In one embodiment, the spatial scale remote sensing image labeling strategy in step S4 is: each descriptive statement is composed of a small-scale remote sensing object and a spatial relation thereof, and a large-scale object is hidden.

In one embodiment, step S6 specifically includes:

step S6.1: dividing a training set and a verification set from a data sample set according to a preset proportion;

step S6.2: respectively setting training parameters of an FCN network, a U-Net network and an LSTM network;

step S6.3: adding the original image and the large-scale GT image as input data into an FCN, performing iterative training on the FCN, and storing a corresponding result and an optimal model weight obtained after the training is completed;

step S6.4: adding the original image and the small-scale GT image as input data into a U-Net network, performing iterative training on the U-Net network, and storing a corresponding result and an optimal model weight obtained after the training is finished;

step S6.5: LSTM network training: and adding features extracted from the original image by VGG-19 and multi-scale semantic labels as input data into the LSTM network, performing iterative training on the LSTM network, and storing corresponding results and optimal model weights obtained after training.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

the invention provides a remote sensing object interpretation method based on a focusing weight matrix and a variable-scale semantic segmentation neural network, which comprises the following steps of firstly, obtaining a high-resolution remote sensing image of a preset research area, and preprocessing the high-resolution remote sensing image; then, making a thematic map layer of the research area, and rasterizing the vector thematic map to obtain a corresponding grid gray map; then, cutting the preprocessed remote sensing image and the grid gray level image, and extracting two sets of data sample sets with spatial scales; then, carrying out content annotation on each remote sensing image in the two sets of spatial scale data sample sets according to a multi-spatial scale remote sensing image annotation strategy to obtain sample set annotations; then constructing a variable-scale remote sensing image semantic interpretation model, obtaining a multi-scale semantic segmentation image through the interpretation model, extracting masks of two scale objects through a mask extraction algorithm, and associating small-scale mask objects segmented by the U-Net network with nouns in semantic description through variable-scale object identification; training an FCN (fuzzy C-means) network, a U-Net semantic segmentation network and an LSTM (least Square TM) network in the constructed variable-scale remote sensing image semantic interpretation model to obtain a trained model; and finally, recognizing the remote sensing object by using the trained model and adopting a remote sensing object recognition algorithm based on a focusing weight matrix.

Compared with the prior art, the invention constructs a remote sensing image variable-scale semantic interpretation model based on FCN, U-Net and LSTM networks, can generate remote sensing image description with multiple spatial scales, and can segment objects in an image and identify the spatial relationship end to end. Firstly, respectively inputting a remote sensing image into an FCN and a U-Net network to carry out semantic segmentation of two spatial scales, so that each pixel of an original image has a semantic label of two scales, and a hierarchical relation of multi-scale remote sensing objects can be formed; secondly, inputting the features extracted from the same image after the pre-trained VGG-19 into an LSTM network, and outputting semantic descriptions of the remote sensing objects and the spatial relationship thereof in two scales; and finally, establishing the relation between the nouns and the object mask graph in the semantic description through the focusing weight matrix, thereby improving the accuracy of object identification.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for interpreting a remote sensing object based on a focusing weight matrix and a variable-scale semantic segmentation neural network according to the present invention;

FIG. 2 is a schematic diagram of variable scale object segmentation and image semantic annotation according to the present invention;

FIG. 3 is a network model structure diagram of the remote sensing object interpretation method based on the focus weight matrix and the variable scale semantic segmentation neural network.

Detailed Description

The invention aims to solve the technical problem that the identification is inaccurate due to the fact that the spatial relationship of a remote sensing object cannot be accurately identified by the method in the prior art, and provides a method for constructing the link between a noun in semantic description obtained by LSTM and an object mask image obtained by semantic segmentation and transferring the spatial relationship in the semantic description to the object mask images, so that the semantic segmentation of the remote sensing object and the end-to-end identification of the spatial relationship are realized.

In order to achieve the above purpose, the main concept of the invention is as follows:

by designing a remote sensing image variable-scale semantic interpretation model based on FCN, U-Net and LSTM networks, the remote sensing image description with multiple spatial scales can be generated, simultaneously, objects in the image are segmented, and the spatial relationship is identified end to end. Firstly, respectively inputting a remote sensing image into an FCN and a U-Net network to carry out semantic segmentation of two spatial scales, so that each pixel of an original image has a semantic label of two scales, and a hierarchical relation of multi-scale remote sensing objects can be formed; secondly, inputting the features extracted from the same image after the pre-trained VGG-19 into an LSTM network, and outputting semantic descriptions of the remote sensing objects and the spatial relationship thereof in two scales; and finally, establishing the relationship between the nouns and the object mask graph in the semantic description through a focusing weight matrix.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a method for identifying a remote sensing object based on a focus weight matrix and a variable-scale semantic segmentation neural network, please refer to fig. 1, and the method comprises the following steps:

step S1: and acquiring a high-resolution remote sensing image of a preset research area, and preprocessing the acquired high-resolution remote sensing image.

Specifically, the acquired remote sensing image data is preprocessed, including geometric correction, atmospheric correction, clipping processing and the like. The preset study area can be selected according to the needs and the actual situation. In the embodiment, a Quickbird remote sensing image with the resolution of 60cm in a certain area of a certain city is obtained.

Step S2: and vectorizing by using professional GIS software to obtain a thematic map layer of the research area, and rasterizing the vector thematic map to obtain a corresponding grid gray map.

In particular, the professional GIS software may be ArcGIS software or other processing software.

Step S3: and cutting the preprocessed remote sensing image and the grid gray image, and extracting two sets of data sample sets with spatial scales, wherein the two sets of data sample sets with the spatial scales respectively comprise an original image, a large-scale GT image, the original image and a small-scale GT image.

In a specific implementation process, a proper cutting scale can be selected, the remote sensing image and the grid gray-scale image of the research area are cut by utilizing the ArcGIS script, and the cut sample is named by ID plus a suffix name of an image format. By the method of step S3, two sets of data sets of spatial scales can be extracted, one set includes the original image and the large-scale GT image, and the other set includes the original image and the small-scale GT image.

Step S4: and carrying out content annotation on each remote sensing image in the two sets of spatial scale data sample sets according to a multi-spatial scale remote sensing image annotation strategy to obtain sample set annotations.

Specifically, step S4 is to make labels GT of the sample set, make multi-scale semantic labels for each image in the sample set according to the semantic labeling policy, and write the labeling results into an Excel table, where the first column of each row in the table is the image name of each individual image, and the following is the corresponding multi-scale labeling statement.

Specifically, the multi-spatial scale remote sensing image labeling strategy of the invention is as follows:

(1) each description is composed of small-scale remote sensing objects and spatial relations thereof, and the small-scale remote sensing objects imply a large-scale object. That is, the image contains many large-scale objects, which means that each large-scale object contains many small-scale objects, and spatial relationships exist between objects of the same scale. Our labeling strategy is to describe the scale and spatial relationship information contained in the image as completely as possible, as shown in FIG. 2, in which O is_i，O_jRepresenting a large-scale object, 0_i1，O_i2，O_j1，O_j2，O_jnRepresenting small scale objects.

(2) In small scale labeling, one object is usually selected as the primary object to which the other objects are attached through spatial relationships. In this way, homogeneous small-scale objects do not appear repeatedly in one large object.

(3) If there are two or more large-scale objects, the corresponding sub-descriptions (small-scale objects and their spatial relationships, e.g. O)_i1R_i12O_i2...) with a connection.

Step S5: constructing a variable-scale remote sensing image semantic interpretation model, obtaining a multi-scale semantic segmentation image through the interpretation model, extracting masks of two scale objects through a mask extraction algorithm, and associating a small-scale mask object segmented by a U-Net network with a noun in semantic description through variable-scale object identification, wherein the variable-scale remote sensing image semantic interpretation model comprises the following steps: the system comprises an FCN full convolution network, a U-Net semantic segmentation network and an LSTM network based on an Attention mechanism, wherein the FCN network is used for large-scale object segmentation, the U-Net network is used for small-scale object segmentation, and the LSTM is used for generating semantic description containing two space scale objects and space relation thereof.

Specifically, LSTM networks of FCN and U-Net semantic segmentation networks based on the Attention mechanism can be constructed in Tensorflow respectively. The FCN network is input with an original image and a large-scale GT image during training, the U-Net network is input with an original image and a small-scale GT image during training, and the LSTM network is input with an original image and a multi-scale semantic annotation GT during training, namely the image semantic annotation made in S4 comprises an artificial annotation statement of each image. In this way, a large-scale semantic segmentation image, a small-scale semantic segmentation image and multi-scale semantic description can be obtained by the model in a verification stage, wherein the semantic segmentation image can extract a mask of an object in the image through a mask algorithm, and finally, a small-scale mask object segmented by the U-Net network is associated with a noun in the semantic description through a variable-scale remote sensing object recognition algorithm, so that the spatial relationship between the remote sensing objects is obtained from the semantic description in this way, namely, the spatial relationship between the remote sensing objects is obtained through the constructed variable-scale remote sensing image semantic interpretation model in order to recognize the spatial relationship between the remote sensing objects, wherein the specific model structure is shown in FIG. 3.

Step S6: and training an FCN (fuzzy C-means) network, a U-Net semantic segmentation network and an LSTM (least Square TM) network in the constructed variable-scale remote sensing image semantic interpretation model to obtain a trained model.

Wherein, step S6 specifically includes:

In a specific implementation, step S6.1 randomly divides the 1835 study sample sets into training sets and validation sets according to a certain proportion, for example, 1167 training samples and 668 validation samples are obtained.

Step S6.2, training parameters are set: for the FCN network, setting a learning rate of 1 × e-5, a batch _ size of 1 and an iteration number of 60000, setting a learning rate of 1 × e-4, a batch _ size of 20 and an iteration number of 120 in the U-Net network, and setting a Dropout parameter of 0.7 to prevent an overfitting phenomenon of the network; for the LSTM network, it is necessary to extract image features using a VGG-19 pre-training model, the size of the feature map is 14 × 512, the number of hidden layer neurons is set to 1024, the word embedding vector dimension is 512, the learning rate is set to 0.001, the batch _ size is set to 20, and the number of iterations is 120.

In step S6.3, the FCN segmentation precision reaches 0.89 through analysis, and in step S6.4, the U-Net segmentation precision reaches 0.93.

In step S6.5, the evaluation index values are shown in table 1 by analysis:

TABLE 1 LSTM evaluation indexes

	Bleu_1	Bleu_2	Bleu_3	Bleu_4	METEOR	ROUGE_L	CIDEr
								Method for producing a composite material	0.893	0.744	0.655	0.587	0.455	0.779	5.044

In table 1, BLEU is a common machine translation evaluation criterion, and n is usually 1 to 4, an accuracy (precision) based evaluation. The ROUGE _ L is calculated according to the recall rate and is an evaluation criterion of the automatic summarization task. METEOR is used for evaluating machine translation, performing word alignment on a translation given by a model and a reference translation, and calculating the accuracy, recall rate and F value of various conditions such as complete matching of words, stem matching, synonym matching and the like. The CIDER index treats each sentence as a "document," represents it as a tf-idf vector, and then calculates and scores the cosine similarity of the reference caption to the model-generated caption.

Step S7: the method for recognizing the remote sensing object by using the trained model specifically comprises the following steps: and positioning a focusing weight matrix of the noun generated by the LSTM network at the current moment to a corresponding small-scale object in a mask image obtained by the U-Net semantic segmentation, and finishing the identification of the object if the object class label is the same as the noun. Where the focus weight matrix is generated by the LSTM network at each time instant when a word is generated, which represents the region of interest (focus position) in the image for the currently generated word.

Specifically, by designing a remote sensing object recognition algorithm based on a focus weight matrix, the trained model can be used for recognizing the remote sensing object.

The identification of the remote sensing object is based on a focusing weight matrix generated by an LSTM network and a mask object extracted by a semantic segmentation graph obtained by a U-Net network through a mask algorithm. First, in the present embodiment, a 14 × 14 weight matrix (i.e., focus weight matrix) is resampled to a 210 × 210 size, and defined

Is the weight value of the focus weight matrix at the (i, j) position m_ijFor the pixel value of the mask object image obtained after the U-Net segmentation at the (i, j) position, in each mask object, the pixel value is alignedThe pixel value of the position of the image is the class index value C of the object, and the rest positions are 0.

The region of intersection of the region of interest weight matrix and the object mask map may be calculated using the following formula:

wherein C is a normalization factor, and the average weight value of the intersection region can be calculated by the following formula:

and if the class label of the remote sensing object is the same as the noun generated at the moment t, the position and the boundary of the remote sensing object can be identified through an object mask diagram.

Generally speaking, the invention designs a remote sensing image variable-scale semantic interpretation model based on FCN, U-Net and LSTM networks, which can generate remote sensing image description with multiple spatial scales, segment objects in images and identify spatial relations end to end. Firstly, respectively inputting a remote sensing image into an FCN and a U-Net network to carry out semantic segmentation of two spatial scales, so that each pixel of an original image has a semantic label of two scales, and a hierarchical relation of multi-scale remote sensing objects can be formed; secondly, inputting the features extracted from the same image after the pre-trained VGG-19 into an LSTM network, and outputting semantic descriptions of the remote sensing objects and the spatial relationship thereof in two scales; and finally, establishing the relationship between the nouns and the object mask graph in the semantic description through a focusing weight matrix.

In order to further improve the identification accuracy, in one embodiment, when the object class label is different from the noun, the method further includes starting a multi-scale remote sensing object correction algorithm, specifically: firstly, a large-scale mask object obtained by FCN semantic segmentation positioned in a current interest area is positioned through a scale-up method, and then a small-scale object with the same class label and noun is positioned in a candidate large-scale object through a scale-down method, so that the identification of the object is completed.

Specifically, the class label of the remote sensing object obtained directly by the object identification method in step S7 is often different from the noun generated at time t, and in order to solve this problem, the invention further provides a multi-scale remote sensing object correction algorithm.

In a specific implementation process, the effect verification analysis of the multi-scale remote sensing image semantic interpretation model comprises the following steps: and analyzing and verifying the remote sensing object identification and correction result of the model by using the verification set sample so as to test the object identification and correction effect. Through analysis, in 668 verification samples, 300 GT sentences carry "with", and 256 generated description sentences carry "with", accounting for 85%, which indicates that the multi-scale semantic annotation strategy is feasible.

This embodiment analyzes the reliability of the description statements generated by 668 verification samples, and obtains the results shown in tables 2 and 3:

table 2 generated descriptive statement reliability analysis

TABLE 3 noun number matching before and after correction

Through the correction algorithm provided by the invention, the matching rate of nouns is improved from 41.87% to 83.64%, and is improved by 42%, and the experimental result proves that the correction algorithm is scientific and feasible.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A remote sensing object interpretation method based on a focus weight matrix and a variable-scale semantic segmentation neural network is characterized by comprising the following steps:

step S7: the method for recognizing the remote sensing object by using the trained model specifically comprises the following steps: the method comprises the following steps of positioning a focus weight matrix of a noun generated by an LSTM network at the current moment to a corresponding small-scale object in a mask image obtained by U-Net semantic segmentation, and finishing the identification of the object if an object class label is the same as the noun, wherein the focus weight matrix is generated by the LSTM network when a word is generated at each moment and represents a region of interest of the currently generated word in an image, and the focus weight matrix of the noun generated at the current moment by the LSTM network is positioned to the corresponding small-scale object in the mask image obtained by U-Net semantic segmentation, and comprises the following steps:

obtaining pixel values of mask object images obtained after U-Net segmentation at (i, j) positions, wherein in each mask object, the pixel value of the position where the object is located is the class index value C of the object, and the rest positions are 0;

obtaining an intersection area of the focusing weight matrix and the object mask image;

and calculating the average weight value of the intersection region, and selecting the remote sensing object with the maximum average weight value, namely positioning the remote sensing object to a corresponding small-scale object in a mask image obtained by the semantic segmentation of the U-Net.

2. The method of claim 1, wherein when the object class label is not the same as the noun, the method further comprises initiating a multi-scale remote sensing object rectification algorithm, specifically: firstly, a large-scale mask object obtained by FCN semantic segmentation positioned in a current interest area is positioned through a scale-up method, and then a small-scale object with the same class label and noun is positioned in a candidate large-scale object through a scale-down method, so that the identification of the object is completed.

3. The method of claim 1, wherein the method further comprises: and performing effect verification analysis on the multi-scale remote sensing image semantic interpretation model.

4. The method of claim 1, wherein the spatial scale telemetric image labeling policy in step S4 is: each descriptive statement is composed of a small-scale remote sensing object and a spatial relation thereof, and a large-scale object is hidden.

5. The method according to claim 1, wherein step S6 specifically comprises: