CN114529878B

CN114529878B - Cross-domain road scene semantic segmentation method based on semantic perception

Info

Publication number: CN114529878B
Application number: CN202210083793.2A
Authority: CN
Inventors: 雷印杰; 彭铎
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2023-04-25
Anticipated expiration: 2042-01-21
Also published as: CN114529878A

Abstract

The invention relates to the technical field of semantic segmentation, and discloses a semantic-perception-based cross-domain road scene semantic segmentation method, which comprises the following steps: s1, sending source domain images into a feature encoder according to batches to obtain features of a plurality of images in the batches; s2, sending the image features in the batch to a central alignment module based on semantic perception, and realizing global central alignment of source domain images with different styles in a feature space; and S3, sending the characteristics with the aligned centers into a distribution alignment module based on semantic perception, and further realizing local distribution alignment on the image characteristics of various styles in the batch. According to the semantic segmentation method for the cross-domain road scene based on semantic perception, the labeling cost is reduced, the source domain image which is the most basic of the training network can be directly obtained from the game, the corresponding label is directly generated through the engine, manual labeling is not needed, and a large amount of manpower and material resources are saved.

Description

Cross-domain road scene semantic segmentation method based on semantic perception

Technical Field

The invention relates to the technical field of semantic segmentation, in particular to a semantic-perception-based cross-domain road scene semantic segmentation method.

Background

The image semantic segmentation refers to that a computer realizes deep understanding of an image according to semantic content of the image and then performs pixel-level visual segmentation. In recent years, with the continuous development of artificial intelligence, semantic segmentation technology based on a deep neural network is increasingly applied to various aspects such as industrial production, social security, transportation and the like, wherein the realization of unmanned semantic segmentation is a popular research field, and has a good development prospect. The semantic segmentation is a core algorithm of unmanned vehicle driving, when the vehicle-mounted camera probes the image, the image is input into the neural network, and the background operation equipment can automatically segment and classify the image so as to realize obstacle avoidance of pedestrians and vehicles.

In recent years, the performance of deep learning semantic segmentation has tended to be perfect, but researchers have found that models that perform well on training datasets are not ideal when applied in other scenarios because the training image (source domain) and the applied image (target domain) exist in two different domains, with inconsistent data distributions. In this regard, researchers have proposed a large number of artificial intelligence methods based on deep learning to cope with the problem of efficiency attenuation of semantic segmentation after cross-domain, and generally these methods need to acquire images of target domains in advance to further adapt the source domain distribution to the target domain distribution, so as to improve the segmentation robustness in the target domain, but there are two problems: first, most of the current methods need to acquire a part of image data of a new target area in advance to adapt to the new target area, which is clearly very labor-and material-consuming, and for example, an operator cannot acquire road images of all target areas in advance. Secondly, these methods aim at performing cross-domain segmentation on a certain known target domain, and a fixed model can only be applied to the specific domain, cannot be generalized to other domains, and cannot meet the needs of practical applications, so that a method which does not depend on target domain data and has considerable generalization performance is very required to be provided. In order to solve the problems, the invention processes data in a feature space, aligns the data from a center and a distribution layer, and respectively aligns the features of different categories in consideration of the distribution difference of the features of different categories, thereby realizing domain-invariant feature transformation with high discrimination and enhancing the multi-domain generalization performance of the model.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides a semantic perception-based cross-domain road scene semantic segmentation method, which has the advantages of realizing high discrimination and invariable feature conversion by training a model only through source domain data without depending on any target domain data and data enhancement method, ensuring the model to have considerable multi-domain generalization performance and the like, and solving the problems in the background technology.

(II) technical scheme

The invention provides the following technical scheme: a semantic-aware-based cross-domain road scene semantic segmentation method, comprising the steps of:

s1, sending the source domain images into a feature encoder according to batches to obtain features of a plurality of images in the batches.

S2, sending the image features in the batch to a central alignment module based on semantic perception, and realizing global central alignment of source domain images with different styles in a feature space.

And S3, sending the characteristics with the aligned centers into a distribution alignment module based on semantic perception, and further realizing local distribution alignment on the image characteristics of various styles in the batch.

And S4, sending the processed features to a feature decoder to obtain a semantic segmentation prediction result consistent with the original image in size.

S5, calculating a loss value of the segmentation prediction result of the current batch of images and training the network according to the loss value.

And S6, saving the training model to obtain the model which is applied to any scene for semantic segmentation.

Preferably, in the step S1, the number of images in the batch process is at least 2 images.

Preferably, in the steps S2 and S3, the segmentation classes are ordered according to the proportion occupied in the data set, and the first 16 classes are selected as semantic perception objects.

Preferably, in the step S4, the decoded image is consistent with the original image, where the original image refers to an original image after clipping and scaling, the clipping mode is clipping at the long side of the image with equal step length according to the width of the image, and then scaling uniformly to an image with a resolution of 640 x 640, and the network model is one of three of deep convolutional neural networks VGG-16, resnet-50 and Resnet-101.

Preferably, in the step S1, feature extraction is performed on the source domain batch image X by a feature encoder F (), so as to obtain an image feature F, which is shown in the following formula:

original image feature f=f (X)

Preferably, in the step S2, the source domain original features are first aligned in the center, and in order to improve the feature discrimination, the category features are first roughly separated:

category characteristics

Wherein M is _c Class c mask, generated by semantic segmentation classifier, F' _c The obtained class c features are coarsely extracted.

Preferably, the coarse features are refined to obtain accurate class features:

optimized class features

Wherein Sigm () is a Sigmoid activation function, compressing the feature values to [0,1 ]]Inside, f ^3×3 (.) a convolution of 3X 3, F' _c,max And F' _c,avg The characteristic value obtained after maximum value pooling and average value pooling for the category coarse characteristic,

is element-by-element dot product.

Preferably, each obtained fine category feature is independently normalized to realize semantic perception center alignment, and finally fused into a single feature map:

post center alignment feature

Where c=16 is the number of categories, IN (F _c ,M _c ) To M _c In-range for category characteristic F _c And carrying out instance normalization to realize semantic perception center alignment.

Compared with the prior art, the invention provides a semantic perception-based cross-domain road scene semantic segmentation method, which has the following beneficial effects:

1. according to the semantic segmentation method for the cross-domain road scene based on semantic perception, the labeling cost is reduced, the source domain image which is the most basic of the training network can be directly obtained from the game, the corresponding label is directly generated through the engine, manual labeling is not needed, and a large amount of manpower and material resources are saved.

2. The cross-domain road scene semantic segmentation method based on semantic perception is convenient to train, the model does not need to additionally capture a target domain image, and meanwhile, data enhancement is not required to be carried out on source domain data, and the influence of any image style is avoided, so that the method can be universally applied to various real application scenes.

3. The semantic-perception-based cross-domain road scene semantic segmentation method has high precision and good universality, and on the basis of not contacting any target domain data, four cross-domain semantic segmentation settings of GTA5 to Cityscapes, GTA to BDDS, GTA5 to Mapilary and GTA5 to SYNTHIA respectively reach 38.21, 36.30, 36.87 and 28.45 mIoU on a backbone network of VGG-16; on the backbone network of Resnet-50, mIoU of 39.75, 37.34, 41.86 and 30.79 are respectively reached on four cross-domain semantic segmentation settings; on the backbone network of Resnet-101, mIoU of 45.33, 41.18, 40.77 and 31.84 are achieved on four cross-domain semantic segmentation settings, respectively.

4. The cross-domain road scene semantic segmentation method based on semantic perception has good development prospect, data does not depend on any target domain data, does not depend on any data enhancement means, can be multiplexed with other feature normalization and data enhancement methods, and can meet the requirements of deeper networks and higher computational power conditions in the future.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a schematic diagram of a network architecture according to the present invention;

fig. 3 is a cross-domain semantic segmentation effect contrast diagram under four real scenes.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-3, a semantic segmentation method for a cross-domain road scene based on semantic perception, the semantic segmentation method comprises the following steps:

In this embodiment, in step S1, no less than 2 images are required to participate in the network training as a composition batch, and feature extraction is performed on the source domain batch image X by a feature encoder F (), so as to obtain an image feature F, as shown in the following formula:

original image feature f=f (X)

In this embodiment, in steps S2 and S3, the segmentation classes are ordered according to the proportion occupied in the dataset, and the first 16 classes are selected as semantic perception objects. In step S2, the original features of the source domain are first aligned in the center, and in order to improve the feature discrimination, the category features are first roughly separated:

category characteristics

Wherein M is _c Class c mask, generated by semantic segmentation classifier, F' _c The obtained class c features are coarsely extracted. And then refining the coarse features to obtain accurate class features:

optimized class features

is element-by-element dot product. Finally, each obtained fine category feature is independently normalized to realize semantic perception center alignment, and finally, the fine category features are fused into a single feature map:

post center alignment feature

In this embodiment, step S3 further aligns the center-aligned features to distribution, and groups the features according to channels first to obtain an mth group of nth image features in the batch:

feature channel grouping

Then, the intra-group features are whitened by the example according to the channel grouping, and the specific formula is as follows:

semantic level distribution alignment loss

Wherein, ψ (), I is a unit matrix, and the decorrelation of the channels is realized by constraining the channel covariance matrix in the packet as a unit matrix, thereby achieving the aim of aligning semantic perception distribution.

The semantic perception-based cross-domain road scene semantic segmentation method realizes center alignment and distribution alignment in a feature space by batched image data, and simultaneously maintains the distance between class features to realize domain-invariant feature conversion with high discrimination; the method fully considers that the data of the target domain is difficult to acquire in advance in the application scene, and proposes that the network is trained only in the source domain, thereby realizing the reliable cross-domain segmentation effect and having strong model universality.

In the encoding process, conventional VGG-16, resnet-50 and Resnet-101 are used for feature extraction; the network structure is an end-to-end 'encoding-decoding' structure, in the decoding process, each module receives the output of the previous module as input, then nearest neighbor interpolation is carried out, so that the size of the feature map becomes 2 times of the input, and in the training process, the current segmentation effect of the network is measured by adopting a cross entropy loss function form, and the network weight is penalized.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A semantic perception-based cross-domain road scene semantic segmentation method is characterized by comprising the following steps of: the semantic segmentation method comprises the following steps:

s1, sending source domain images into a feature encoder according to batches to obtain features of a plurality of images in the batches;

s2, sending the image features in the batch to a central alignment module based on semantic perception, and realizing global central alignment of source domain images with different styles in a feature space;

firstly, aligning the centers of original features of a source domain, and in order to improve the feature discrimination, roughly separating category features:

category characteristics

Wherein M is _c Class c mask, generated by semantic segmentation classifier, F _c ' is a class c feature obtained by rough extraction;

refining the coarse features to obtain accurate class features:

optimized class features

is element-by-element dot multiplication;

normalizing each obtained fine category feature independently to realize semantic perception center alignment, and finally fusing into a single feature map:

post center alignment feature

Where c=16 is the number of categories, IN (F _c ,M _c ) To M _c In-range for category characteristic F _c Performing instance normalization to realize semantic perception center alignment;

s3, sending the characteristics with the aligned centers into a distribution alignment module based on semantic perception, and further realizing local distribution alignment on the image characteristics of various styles in the batch;

the center-aligned features are further distributed and aligned, and the features are firstly grouped according to channels to obtain an mth group of nth image features in the batch:

feature channel grouping

semantic level distribution alignment loss

Wherein, ψ () is a channel covariance matrix, I is a unit matrix, and the decorrelation of the channels is realized by restricting the channel covariance matrix in the group as the unit matrix, thereby achieving the aim of aligning semantic perception distribution;

s4, sending the processed features to a feature decoder to obtain a semantic segmentation prediction result consistent with the original image in size;

s5, calculating a loss value according to the segmentation prediction result of the current batch of images and training a network according to the loss value;

2. The semantic perception-based cross-domain road scene semantic segmentation method according to claim 1, wherein the method is characterized by comprising the following steps of: in the step S1, the number of images in batch processing is at least 2 images.

3. The semantic perception-based cross-domain road scene semantic segmentation method according to claim 1, wherein the method is characterized by comprising the following steps of: in the steps S2 and S3, the segmentation classes are ordered according to the proportion occupied in the data set, and the first 16 classes are selected as semantic perception objects.

4. The semantic perception-based cross-domain road scene semantic segmentation method according to claim 1, wherein the method is characterized by comprising the following steps of: in the step S4, the decoded image and the original image are kept consistent in size, where the original image refers to an original image after clipping and scaling, the clipping mode is clipping at the long side of the image with equal step length according to the width of the image, and then scaling is performed uniformly to an image with 640 x 640 resolution, and the network model is one of three kinds of deep convolutional neural networks VGG-16, resnet-50 and Resnet-101.

5. The semantic perception-based cross-domain road scene semantic segmentation method according to claim 1, wherein the method is characterized by comprising the following steps of: in the step S1, feature extraction is performed on the source domain batch image X by a feature encoder F (), so as to obtain an image feature F, which is shown in the following formula:

original image feature f=f (X).