CN113095136A

CN113095136A - Unmanned aerial vehicle aerial video semantic segmentation method based on UVid-Net

Info

Publication number: CN113095136A
Application number: CN202110257662.7A
Authority: CN
Inventors: 潘晓光; 陈亮; 董虎弟; 宋晓晨; 张雅娜
Original assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Current assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-07-09
Anticipated expiration: 2041-03-09

Abstract

The invention belongs to the technical field of semantic segmentation, and particularly relates to a semantic segmentation method of unmanned aerial vehicle aerial video based on UVid-Net, which comprises the following steps: data acquisition: collecting a data set for unmanned aerial vehicle video semantic segmentation, and carrying out pixel-level labeling on the category of the data set to complete construction of the data set required by model training; data preprocessing: preprocessing comprises normalization, data division, image scaling and the like, and a data set is amplified to ensure the training effect of the model; identifying a model: building a semantic segmentation model based on UVid-Net, inputting training data, and completing construction of a parameter model; and (3) model saving: when the loss function of the model is not reduced any more, the model is saved; and (3) evaluating a model: and performing performance evaluation on the segmentation result of the model through various evaluation indexes. The coding path of the invention captures the time dynamics of the video by extracting the features from the multi-frame, and can be used for reserving the features of the encoder layer and improving the semantic segmentation performance.

Description

Unmanned aerial vehicle aerial video semantic segmentation method based on UVid-Net

Technical Field

The invention belongs to the technical field of semantic segmentation, and particularly relates to a semantic segmentation method of unmanned aerial vehicle aerial video based on UVid-Net.

Background

Aerial image analysis has been used to assess damage immediately after a natural disaster. Typically, aerial images are captured by different imaging modalities such as Synthetic Aperture Radar (SAR), hyperspectral imaging, etc. on satellites. In recent years, Unmanned Aerial Vehicles (UAVs) have also been widely used in various applications such as disaster management, urban planning, wildlife tracking, and agricultural planning. Drone images/video, due to rapid deployment and customized flight paths, can provide additional finer details and complement critical applications of satellite-based image analysis methods, such as disaster response. In addition, drone images may be used in conjunction with satellite images to better enable city planning or geographic information updates. However, the unmanned aerial vehicle image/video analysis is limited to target detection and identification tasks, such as building detection, road segmentation, etc., and currently, the semantic segmentation work for unmanned aerial vehicle images or videos is limited.

Problems or disadvantages of the prior art: current semantic segmentation is the process of assigning a predetermined class label to all pixels in an image. However, there are certain difficulties in how to integrate temporal information in the semantic segmentation process of extended video applications.

Disclosure of Invention

Aiming at the problems, the semantic segmentation method of the unmanned aerial vehicle aerial photography video based on UVid-Net is provided, an extended version ManipalUAVid data set for unmanned aerial vehicle video semantic segmentation is collected, and pixel level labeling is carried out on four background classes of greening, buildings, roads and water bodies. And after data collection is finished, preprocessing the data, wherein the preprocessing comprises segmentation and noise addition. Inputting the preprocessed data into the well-built UVid-Net network for training the network model, storing the model until the loss function of the model does not decrease any more, completing the construction of the model, and evaluating the performance of the model through various evaluation methods. The technical scheme adopted by the invention is as follows:

a semantic segmentation method of unmanned aerial vehicle aerial video based on UVid-Net comprises the following steps:

s100, data acquisition: collecting a data set for unmanned aerial vehicle video semantic segmentation, and carrying out pixel-level labeling on the category of the data set to complete construction of the data set required by model training;

s200, data preprocessing: preprocessing comprises normalization, data division, image scaling and the like, and a data set is amplified to ensure the training effect of the model;

s300, identifying a model: building a semantic segmentation model based on UVid-Net, inputting training data, and completing construction of a parameter model;

s400, model storage: when the loss function of the model is not reduced any more, the model is saved;

s500, model evaluation: and performing performance evaluation on the segmentation result of the model through various evaluation indexes.

Further, the step S100 of collecting data specifically includes: the method comprises the steps of collecting various extended version ManipalUAVid data sets for unmanned aerial vehicle video semantic segmentation, wherein the data sets comprise a plurality of videos, providing labels for key frames of the videos, and carrying out pixel-level labeling on four background classes of greening, buildings, roads and water bodies.

Further, in step S200, the specific steps include: normalization, normalization operation is carried out on all data, dimensions are unified, and model training is facilitated

Where x' is the normalized data set and x is the unprocessed data set.

Data division: the data set was as follows 7: 2: the proportion of 1 is divided into a training set, a verification set and a test set, wherein the training set is used for training the model, the verification set is used for detecting whether the model loss continuously decreases, and the test set is used for testing the model effect;

image zooming: because the size of the acquired image in the original image dataset is not fixed, all data obtained after the dataset is divided are scaled so as to input a model, and the model is adjusted to the size of 1280 multiplied by 720 pixel size according to the size proportion;

data expansion: the collected two data sets for unmanned aerial vehicle video semantic segmentation are fused to form a new extended ManipalUAVid data set, the data set obtains more video data by combining the two data sets, and each video comprises more frames, so that the time consistency of the video semantic segmentation model can be evaluated.

Further, S300 further includes model building: constructing a network model based on UVid-Net to perform semantic segmentation on unmanned aerial vehicle aerial video, and performing semantic segmentation on two frames

And

as input to the network model, and then

Semantic segmentation is carried out, the model is divided into two modules of encoding and decoding, feature extraction is respectively carried out by two different structures of U-Net and ResNet-5 in an encoding stage, wherein a U-Net encoder consists of a convolutional layer and a maxpool layer and is used for feature extraction, an upper branch of the encoder comprises four modules, the upper branch module consists of two continuous 3x3 convolutional layers including a batch normalization function and a ReLU activation function, then, the activation process is used for reducing the dimension of a feature map by one 1x1 convolutional layer, finally, the maxpooling layer is used for extracting the most significant features for subsequent layers, after each maximum pooling operation, the number of feature mappings is doubled, a lower branch of the encoder also consists of four modules, each lower branch of the modules respectively comprises a group of 3x3 convolutional layer and a maxpooling layer with a batch normalization function and a ReLU activation function, what is needed isThe maxporoling layer extracts the most significant features and, like the above branches, doubles the number of feature maps after each max pooling operation.

Further, the features extracted by the upper and lower branches of the encoder are input to two separate bottleneck layers, and finally, the activation of the two branches is concatenated and provided to the decoder; while the ResNet-50 feature extractor consists of residual blocks, which help to mitigate vanishing gradients, the architecture consists of an initial kernel-size (7x7) convolution operation followed by a batch normalization layer and a ReLU activation function.

Further, in S400, after the loss function of the model is no longer reduced, the model is saved, and the loss function is calculated by using the cross-entropy loss function, which is expressed as follows:

where yi represents the label of sample i, pi represents the probability that sample i predicts correctly, and N represents the number of categories.

Further, in S500, the performance of the model segmentation result is evaluated by calculating the following evaluation indexes, such as: intersection mean (mIoU), Precision (Precision), Recall (Recall), and F1-Score, which are formulated as follows:

wherein TP, FP, TN and FN represent true positive, false positive, true negative and false negative predictions, respectively.

Has the advantages that: the invention provides a CNN structure (UVid-Net) based on an enhanced codec for unmanned aerial vehicle video semantic segmentation. The encoder of the proposed architecture embeds time information for time-consistent tags, and the decoder is enhanced by introducing a feature retainer module. The structure has two parallel CNN layers for feature extraction. This new encoding path captures the temporal dynamics of the video by extracting features from multiple frames. These features are further processed by the decoder for class label estimation, and the algorithm adopts a new decoding path, which can be used to retain the features of the encoder layer and improve the semantic segmentation performance.

Drawings

FIG. 1 is a system flow diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The application discloses a semantic segmentation method of unmanned aerial vehicle aerial video based on UVid-Net, which comprises the following steps:

Where x' is the normalized data set and x is the unprocessed data set.

Further, in S300, a network model based on UVid-Net is constructed to carry out semantic segmentation on the unmanned aerial vehicle aerial video, and two frames are divided

And

as input to the network model, and then

And performing semantic segmentation. The model is divided into encoding and decodingAnd the two modules respectively carry out feature extraction through two different structures, namely U-Net and ResNet-5, in the encoding stage, wherein the U-Net encoder consists of a convolution layer and a maxpool layer and is used for feature extraction, and the upper branch of the encoder comprises the four modules. Each block consists of two consecutive 3x3 convolutional layers, including the batch normalization and the ReLU activation functions. The activation process is then used to reduce the dimensionality of the feature map by a 1x1 convolutional layer. Finally, the maxporoling layer is used to extract the most prominent features for subsequent layers. And the number of feature maps doubles after each max pooling operation. The lower branch of the encoder is also composed of four modules. The lower branch blocks each have a set of 3x3 convolutional layers with batch normalization and ReLU activation functions, as does the second set. The next layer is maxporoling, which extracts the most prominent features. Similar to the branch above, the number of property maps doubles after each max pooling operation. The features extracted by the upper and lower branches of the encoder are input to two separate bottleneck layers. Finally, the activation of the two branches is concatenated and provided to the decoder; while the ResNet-50 feature extractor consists of residual blocks, which help to mitigate vanishing gradients, the architecture consists of an initial kernel-size (7x7) convolution operation followed by a batch normalization layer and a ReLU activation function. Subsequently, the maximum pool operation with kernel size (3x3) is applied. After the maxpool operation, the architecture consists of four phases. The first stage comprises three remaining blocks, each block containing three layers, each consisting of 64, 64 and 128 filters. The second stage consists of 4 remaining blocks, each block having 3 layers. These three layers use 128, 128 and 256 filters. The third stage includes 6 remaining blocks, each block having 3 layers. These layers use 256, and 512 filters. The fourth stage consists of 3 residual blocks, each having 3 layers. These layers use 512, and 1024 filters. Stages 2,3 and 4 of the first remaining block reduce the width and height of the input dimension 2 with a stride operation. The first and last layers of each remaining block consist of (1x1) kernel size, the second layer consists of (3x3) kernel size, and finally the features are provided to the decoder. By means of these two different feature extractors in the coding stageAnd performing feature extraction on the segments, fusing feature vectors of the two segments, providing the fused features for a decoder for semantic segmentation, wherein the decoding module is the same as the U-Net decoding module and performs the same operation, but the UVid-Net network model finally passes through a feature retention module which combines corresponding feature mappings of an encoder and the decoder, and finally the probability that the pixel belongs to each class is obtained through a SoftMax layer.

Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims

1. A semantic segmentation method of unmanned aerial vehicle aerial video based on UVid-Net is characterized in that: comprises the following steps:

2. The UVid-Net based semantic segmentation method for the unmanned aerial vehicle aerial video according to claim 1, wherein the specific steps of data collection in step S100 are as follows: the method comprises the steps of collecting various extended version ManipalUAVid data sets for unmanned aerial vehicle video semantic segmentation, wherein the data sets comprise a plurality of videos, providing labels for key frames of the videos, and carrying out pixel-level labeling on four background classes of greening, buildings, roads and water bodies.

3. The unmanned aerial vehicle aerial video semantic segmentation method based on UVid-Net as claimed in claim 1, wherein in step S200, the concrete steps include: normalization, normalization operation is carried out on all data, dimensions are unified, and model training is facilitated

Where x' is the normalized data set and x is the unprocessed data set.

4. The method for semantic segmentation of unmanned aerial vehicle aerial video based on UVid-Net as claimed in claim 3, wherein S300 further comprises model construction: constructing a network model based on UVid-Net to perform semantic segmentation on unmanned aerial vehicle aerial video, and performing semantic segmentation on two frames

And

as input to the network model, and then

Semantic segmentation is carried out, the model is divided into an encoding module and a decoding module, feature extraction is respectively carried out through two different structures U-Net and ResNet-5 in an encoding stage, wherein a U-Net encoder consists of a convolutional layer and a maxpool layer and is used for feature extraction, an upper branch of the encoder comprises four modules, the upper branch module consists of two continuous 3x3 convolutional layers and comprises batch normalization and ReLU activation functions, then the activation process is used for reducing the dimension of a feature map through 1x1 convolutional layer, and finally, the maxpooling layer is used for extracting the most significant features for LU activationSubsequent layers, and each time after maximum pooling, the number of feature maps doubles, the lower branch of the encoder also consists of four modules, each of which has a set of 3 × 3 convolutional layer with batch normalization and ReLU activation functions and a maxpoloring layer that extracts the most significant features and, like the upper branch, doubles the number of feature maps after each maximum pooling.

5. The method for semantic segmentation of unmanned aerial vehicle aerial video based on UVid-Net as claimed in claim 4, wherein features extracted from upper and lower branches of the encoder are inputted into two separate bottleneck layers, and finally, the activation of the two branches is connected and provided to the decoder; while the ResNet-50 feature extractor consists of residual blocks, which help to mitigate vanishing gradients, the architecture consists of an initial kernel-size (7x7) convolution operation followed by a batch normalization layer and a ReLU activation function.

6. The method for semantic segmentation of unmanned aerial vehicle aerial video based on UVid-Net as claimed in claim 1, wherein in S400, when the loss function of the model is no longer reduced, the model is saved, and the loss function is calculated by using cross entropy loss function, and the formula is as follows:

7. The method for semantic segmentation of unmanned aerial vehicle aerial video based on UVid-Net according to claim 1, wherein in S500, the performance of model segmentation result is evaluated by calculating following evaluation indexes, such as: intersection mean (mIoU), Precision (Precision), Recall (Recall), and F1-Score, which are formulated as follows: