CN116993975A - Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation - Google Patents

Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation Download PDF

Info

Publication number
CN116993975A
CN116993975A CN202310844676.8A CN202310844676A CN116993975A CN 116993975 A CN116993975 A CN 116993975A CN 202310844676 A CN202310844676 A CN 202310844676A CN 116993975 A CN116993975 A CN 116993975A
Authority
CN
China
Prior art keywords
network
image
training
semantic segmentation
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310844676.8A
Other languages
Chinese (zh)
Inventor
王军华
袁铮
王鼎
胡凯
叶璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202310844676.8A priority Critical patent/CN116993975A/en
Publication of CN116993975A publication Critical patent/CN116993975A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/955Hardware or software architectures specially adapted for image or video understanding using specific electronic processors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of semantic segmentation, and particularly relates to a panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation. The semantic segmentation method comprises the following steps: constructing a visual transformation network encoder and decoder, and performing encoding and decoding operations on an input image through operations such as a visual transformation network, deformable convolution, a multi-layer perceptron, convolution and the like; learning an object with serious deformation in the panoramic image by using a deformable multistage feature fusion module; the method comprises the steps that an online prototype adaptation module is used, a self-training method in unsupervised field adaptation is combined, and the capability of extracting the feature map of a target field sample by a network is improved through online, unidirectional and class-by-class alignment adaptation operation on the fusion feature map of a pinhole image and a panoramic image; when the network model is applied, a rapid evaluation method is adopted, so that the running speed and the segmentation accuracy of the network are improved. The method solves the problems of high training requirement, low model precision and low model running speed in the application of the traditional semantic segmentation method.

Description

Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation
Technical Field
The invention belongs to the technical field of semantic segmentation, and particularly relates to a panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation.
Background
The semantic segmentation algorithm is mostly applied to images shot by a pinhole camera, and is rarely directly applied to images shot by a panoramic camera. Algorithms for pinhole images are generally complex in design, high in training requirement and low in segmentation accuracy when applied to panoramic images, and on one hand, the tagged panoramic image data set is lacking, and many algorithms are used for coping with the lack of the tagged training data set by adopting unsupervised field adaptation and the like, and are poor in performance and difficult to directly use when being used on the panoramic image although being better in performance on the pinhole image; on the other hand, due to imaging relation and the like, a panoramic image can generate serious distortion of objects in the image, and many semantic segmentation algorithms for pinhole images do not consider the problem or fail to propose a strategy for efficiently processing the problem, and when the algorithms are applied to the panoramic image to cope with the two problems, the problems of lack of a label training set and deformation of the objects are usually solved by using methods and model designs such as unsupervised field adaptation, but the effect is often insufficient to support the application of the method on the panoramic image.
In recent years, research on semantic segmentation tasks of panoramic images is increased year by year, and most of the algorithms come from variants of semantic segmentation algorithms of main stream pinhole images, and although the problems are solved in a targeted manner, the requirements on time and tagged data quality are high in most of algorithm training, the training process is complex and tedious, the segmentation accuracy of a model is not high enough, and the framework of the algorithm is difficult to be practically applied to panoramic cameras. The network model used by the algorithm is difficult to consider the three requirements of easy training, high segmentation precision and high running speed, and cannot improve a proper ground application foundation for the semantic segmentation task of the panoramic image.
Disclosure of Invention
Aiming at the situation, the invention aims to provide a panoramic camera semantic segmentation method based on the adaptation of the unsupervised field of deep learning.
The invention provides a panoramic camera semantic segmentation method based on the adaptation of the unsupervised field of deep learning, which comprises the following steps: constructing a decoder of the deformable multistage feature fusion module, constructing an online prototype adaptation module, and adopting a rapid evaluation method when a network model is applied to the panoramic image;
the network model structure of the invention is a visual transformation network codec structure, which comprises a visual transformation network encoder and a decoder composed of a deformable multistage feature fusion module and a classifier. The visual transformation network encoder adopts a mixed transformation network (Mix transformation) 1 encoder, and comprises an overlapped patch embedding module and four transformation network modules, which are used for extracting multi-scale characteristics of an image.
The deformable multistage feature fusion module in the decoder comprises four groups of deformable convolution units, four groups of multilayer perceptron units and one convolution unit, and the deformable multistage feature fusion module is used for further extracting features, aligning channels and fusing the feature graphs of each scale through the channels. In a decoder, the multi-scale feature map is firstly further extracted by four groups of deformable convolution units and four groups of multi-layer perceptron units to align feature map channels; then, uniformly transforming the size into 1/4 of the size of the input image through bilinear interpolation, and obtaining a fusion feature map through a convolution unit fusion channel by the feature map of each scale obtained through superposition; finally, the fusion feature map is processed by a classifier in a decoder to obtain a probability map of image pixel-by-pixel segmentation; and obtaining a final semantic segmentation label through bilinear interpolation, softMax and max operations.
Further, in the deformable multistage feature fusion module, each deformable convolution unit comprises three convolution operations, wherein the first two convolution operations are used for predicting the offset and the mask of the convolution kernel value of the third convolution operation, and the third convolution operation carries out convolution by combining the offset and the mask; and the feature map extraction of the deformed object in the panoramic image is processed on the premise of maintaining the low parameter quantity. The deformable multistage feature fusion module receives four feature graphs with different scales from the encoder, wherein the scales are 1/4, 1/8, 1/16 and 1/32 of the input size respectively, and four groups of input channel numbers are corresponding to the scales; the four-level feature images after being processed are spliced on the channels by the combination of different deformable convolution units and multi-layer perceptron units, and the channels are reduced to the designated channel number by one convolution unit, so that the features of deformed objects in the panoramic image are fused better.
The online prototype adaptation module comprises a convolution unit which is combined with unsupervised field adaptation training, and the fused feature map is further mapped to 256-channel embedded features. In the non-supervision domain adaptation training process, a labeled pinhole image dataset is used as a source domain, and a non-labeled panoramic image dataset is used as a target domain. The online prototype adaptation module screens commonly-occurring categories by using real labels of a source domain and pseudo labels generated in an unsupervised training process of a target domain; calculating vector centers of all common categories in the source domain embedded features, namely, class prototypes of the common categories; the loss between the embedded features of the target domain and the source domain embedded feature class model based on the knowledge distillation loss function is calculated. Therefore, the distribution distance of the common class feature vector of the source domain and the target domain can be shortened in a hidden space (latency space), so that the model can refer to the feature vector of the class learned by the source domain under the supervision training to learn the feature vector of the same class in the target domain better.
The method for quick evaluation when the network model is applied is that a large-scale panoramic image is segmented into four parts in the horizontal direction in advance when the network model is applied, the four parts are overlapped into a batch, then the network model is used for processing, and the segmented label images of all parts of the batch obtained after processing are subjected to splicing and restoration to obtain a semantic segmented label image corresponding to the position of the original panoramic image. The method can greatly improve the processing speed of the model and the semantic segmentation accuracy.
The invention provides a panoramic camera semantic segmentation method based on the adaptation of the unsupervised field of deep learning, which comprises the following specific steps:
firstly, shooting images by using a panoramic camera under indoor and outdoor environments, adjusting parameters such as exposure time, focal length, aperture, intelligent brightness and the like of the camera according to environmental conditions and use requirements, and acquiring and storing original images under the condition of ensuring image quality;
screening the collected panoramic image, expanding by using a method of transforming polar coordinates into Cartesian coordinates, and cutting a blind area to obtain a complete panoramic image; constructing a label-free training image data set containing not less than 1000 images from the thus collected and processed images; according to scene requirements, in addition, 50 to 100 images are selectively selected and marked for constructing a test data set;
step three, selecting a semantic segmentation dataset of the pinhole image with the tag, such as Cityscapes, ADE K, preprocessing the dataset, and selecting a plurality of categories such as people, bicycles, buildings and the like which need to be classified when the semantic segmentation is performed on the panoramic image;
loading pre-training weights for an encoder neural network and a decoder neural network based on a visual transformation network, initializing parameters of each network model required under an unsupervised field adaptation training framework, wherein the parameters comprise a student network and a teacher network, the student network is a semantic segmentation network to be obtained, a pinhole image dataset and an acquired panoramic image dataset are respectively used as a source domain and a target domain, images are selected according to batches and sent into the student network for training, a loss function is optimized, and unsupervised field adaptation training is performed;
and fifthly, semantic segmentation testing, namely, invoking a trained semantic segmentation network model, processing the panoramic image subjected to expansion and cutting by a method for rapidly evaluating when the network model is applied, inputting the panoramic image into the model, and outputting a semantic segmentation image label.
Further, in the unsupervised field adaptation training process, each loss function value is calculated for the forward propagation of the student network, and the network parameters are updated through backward propagation operation; parameters for the teacher network are updated by parameters of the student network after each training iteration.
In the training process, the sizes and widths of the pinhole image and the panoramic image are 1024x2048 and 512x2048 respectively. The panoramic image should contain as much as possible a full horizontal field of view. The training data enhancement settings may be self-referencing adjusted for different image sizes.
The traditional method and the semantic segmentation method of the existing research focus on the problem that the segmentation accuracy of a model in a panoramic image is improved by utilizing a complex network structure design and a training method aiming at a pinhole image, but the speech meaning segmentation of the panoramic image often needs larger training time cost and hardware cost, and the application of the method also faces the difficulties of low model running speed, insufficient precision and the like. According to the invention, only partial panoramic images are required to be collected by oneself to build a simple data set, the features learned by using an unsupervised field adaptation method are suitable for the images shot by a given panoramic camera, the distortion phenomenon of objects in the images can be effectively treated, the requirements of semantic segmentation of the panoramic camera in a real application scene are met, namely, the training process should be simple and the model should have enough accuracy. In addition, when the model is applied, the long and complex panoramic image is horizontally segmented into small images to be sequentially processed, so that the GPU performance can be more reasonably utilized, and the image processing speed is improved.
The invention has the positive progress effects that:
according to the invention, a semantic segmentation neural network based on a visual transformation network is constructed, a deformable convolution unit is added for further extracting the characteristics of a distorted object, an online prototype adaptation module is added, a self-training method is used for training a semantic segmentation network model, a rapid evaluation method is used for processing a panoramic image by applying the trained semantic segmentation network, and a semantic segmentation label is obtained through output. According to the requirements of different indoor and outdoor application scenes of the deep learning application of the panoramic camera, the semantic segmentation is carried out on the images of the panoramic camera under different emphasis categories of different indoor and outdoor scenes, the difficulties that the training requirement of the conventional training segmentation model is high, the model precision is low, the model running speed is low and the like are solved, and the method can be applied to other visual tasks of the panoramic camera.
Drawings
Fig. 1 is a flowchart of the panorama camera semantic segmentation method based on unsupervised domain adaptation of the present invention.
Fig. 2 is a training flow diagram of the self-training method based on unsupervised domain adaptation of the present invention.
Fig. 3 is a block diagram of the overall structure of a codec neural network based on a visual transformation network according to the present invention.
Fig. 4 is a schematic diagram of a deformable multi-level feature fusion module of the present invention.
Fig. 5 is a schematic diagram of an on-line prototype adaptation module according to the invention.
Fig. 6 is a schematic diagram of the rapid assessment method of the present invention.
Detailed Description
The invention is further described below by way of examples with reference to the accompanying drawings.
As shown in fig. 1, the panoramic camera semantic segmentation method based on the deep learning unsupervised field adaptation of the invention comprises the following steps:
firstly, shooting images by using a panoramic camera under indoor and outdoor environments, adjusting parameters such as exposure time, focal length, aperture, intelligent brightness and the like of the camera according to environmental conditions and use requirements, and acquiring and storing original images under the condition of ensuring image quality;
screening the collected panoramic image, expanding by using a method of transforming polar coordinates into Cartesian coordinates, and cutting a blind area to obtain a complete panoramic image; constructing a label-free training image data set containing not less than 1000 images from the thus collected and processed images; according to scene requirements, in addition, 50 to 100 images are selectively selected and marked for constructing a test data set;
step three, selecting a semantic segmentation dataset of the pinhole image with the tag, such as Cityscapes, ADE K, preprocessing the dataset, and selecting a plurality of categories such as people, bicycles, buildings and the like which need to be classified when the semantic segmentation is performed on the panoramic image;
step four, constructing an encoder and decoder neural network based on a visual transformation network, loading pre-training weights, initializing parameters of various network models required under an unsupervised field adaptation training framework, wherein the parameters comprise a student network and a teacher network, the student network is a semantic segmentation network to be obtained, a pinhole image dataset and an acquired panoramic image dataset are respectively used as a source domain and a target domain, images are selected according to batches and sent into the student network for training, a loss function is optimized, and unsupervised field adaptation training is performed;
and fifthly, semantic segmentation testing, namely, invoking a trained semantic segmentation network model, processing the panoramic image subjected to expansion and cutting by a method for rapidly evaluating when the network model is applied, inputting the panoramic image into the model, and outputting a semantic segmentation image label.
The original panoramic image is collected and subjected to image preprocessing operations, including expanding and cropping image dead zones. The labeled pinhole image dataset is used as a source domain, and the unlabeled panoramic image dataset is used as a target domain. Applying data enhancement to the source domain image includes, random gaussian blur with a random probability of 0.2; random color dithering, brightness, contrast and saturation of 0.25, and random probability of 0.2; random up-down, left-right overturn, random probability is 0.2; the width and length of the image are adjusted to 512x1024, the random image is scaled, the scaling factor is 1.2 to 2.0, the random probability is 1.0, and the part with the size of 512x512 is randomly cut. Applying data enhancement to the target domain image to randomly turn up and down and left and right, wherein the random probability is 0.2; the image size is adjusted to be 768x3072 wide and long, the random image is scaled, the scaling factor is 1.0 to 1.5, the random probability is 1.0, and the part with the size of 512x512 is randomly cut. When the panoramic image and the pinhole image are similar in size and the field difference is 3 times, the panoramic image is 1.5 times longer and wider than the pinhole image when the panoramic image is adjusted in size. The final images were regularized by the mean and variance of ImageNet.
As shown in FIG. 2, the invention refers to a self-training method in unsupervised domain adaptation used by DAFormer [2], and the strategies originally used by the framework include Rare Class Screening (RCS), feature distance, class mix enhancement (Classmix) all remain. Self-training is carried out on the pinhole image of the source domain, and non-supervision training is carried out on the panoramic image of the target domain in a mode of generating the pseudo tag on line. The self-training comprises two models with the same structure, and the model which is expected to be trained is a student model, and the other model is a teacher model. The encoder of the student model before training loads the hybrid transform network encoder pre-training weights, the teacher model weights are initialized by the student model and parameters are subsequently updated by the student model also by Exponential Moving Average (EMA). In the training process, the source domain image sample and the label are obtained through rare class screening, and the target domain image sample is obtained randomly. The student model firstly processes the source domain sample to obtain segmentation output and the source domain sample embedded feature, calculates semantic segmentation loss and feature distance loss, and implements back propagation to update student network model parameters. The teacher model processes the target domain sample to obtain a pseudo tag with high confidence coefficient of the target domain sample, and performs class mixing enhancement operation on the source domain sample, the source domain sample tag, the target domain sample and the target domain sample pseudo tag to obtain a mixed sample and a mixed sample pseudo tag. And processing the mixed sample by the student model to obtain mixed sample segmentation output and mixed sample embedding characteristics, calculating mixed sample semantic segmentation loss, calculating online prototype adaptive loss by using the source domain sample embedding characteristics, the mixed sample embedding characteristics, the source domain labels and the mixed sample pseudo labels, and finally calculating gradient update student model parameters. The semantic segmentation loss function is calculated by using OhemCELSS, and the cross entropy loss function based on the OhemCELSS is shown as the following formula (1):
wherein y is a label, S θ For student network, x is input image, C is category number, and H and W are length and width of image respectively. And (3) screening all pixel vectors with the maximum probability reaching the threshold value of 0.7 in advance by OhemCELSs, calculating a loss value by using a formula (1), and if the number of the pixel vectors reaching the threshold value of 0.7 exceeds 3000, selecting the pixel vectors with the maximum probability in the output probability map to calculate the loss value.
As shown in fig. 3, the present invention is based on a hybrid transformation network structure, and can select 6 groups of encoder structures with different sizes as model encoders according to hardware conditions, where the overall network structure includes:
the size of the input image is 512x512 in training, the size of the image can be set arbitrarily in application, and the network structure comprises an encoder and a decoder. Each transformation network module in the encoder sequentially processes images, performs forward propagation, extracts image feature vectors with the sizes of 1/4, 1/8, 1/16 and 1/32 respectively, and adjusts the sizes of the vectors to obtain each level of feature images. The decoder comprises four groups of deformable convolution units, four groups of multi-layer perceptron units and two convolution units:
in the deformable convolution units, all convolution kernels are 3x3 in size, the step sizes are 1x1, the filling is 1, no offset exists, the number of input channels of the four groups of deformable convolution units is 64, 128, 320 and 512, the number of deformable convolution output channels is the same as the number of input channels, the number of offset convolution output channels is 18, the number of mask convolution output channels is 9, and the result output by each deformable convolution operation is subjected to Gaussian error linear activation unit function (GELU).
The input channel number of the four groups of multi-layer perceptron units is 64, 128, 320 and 512, the output channel number is 256, and the linear transformation layer matrix size is the input channel number Nx256. The convolution unit in the deformable multistage feature fusion module is 1x1 in size, 1 in step length, and has no filling and offset convolution operation, and further comprises a BatchNorm operation and a ReLU activation operation, wherein the number of input channels of the convolution unit is 1024, and the number of output channels of the convolution unit is 256.
The classifier of the decoder is a convolution unit, comprises convolution operations of 1x1 size, step length of 1, no filling and no offset, and also comprises a BatchNorm operation and a ReLU activation operation, wherein the input channel number of the classifier is 256, and the output channel number is the class number.
The online prototype adaptation module comprises a convolution unit, which comprises convolution operations with the size of 1x1, the step size of 1, no filling and no offset, and further comprises a BatchNorm operation and a ReLU activation operation, wherein the number of input channels of the convolution unit is 256, and the number of output channels of the convolution unit is 256.
As shown in fig. 4, the calculation method of the deformable multistage feature fusion module of the present invention includes:
the method comprises the steps of a deformable convolution unit, a multi-layer perceptron unit and a convolution unit, wherein all levels of feature images are sequentially subjected to offset convolution processing and mask convolution processing in the deformable convolution unit to obtain offset and mask of convolution kernel values required by the deformable convolution, all levels of feature images are subjected to deformable convolution operation to obtain feature images M with the same number as that of original channels, the size of the feature images M is BxCxHxW, the size of the feature images M is BxCx (HxW) after the feature images M are flattened through flattening in the last two dimensions H, W, the size of the feature images M is Bx (HxW) xC after the flattening, the feature images are subjected to transposition transformation in the last two dimensions, then the feature images are subjected to multi-layer perceptron unit processing to obtain feature images with the size of Bx (HxW) x256, the feature images S are obtained through matrix transformation, the feature images S are subjected to bilinear interpolation adjustment, and the size of the feature images S is training samples 1/4. The four-level feature images from the encoder are subjected to similar operation processing to obtain feature images with the same size and containing features with different scales, the images are overlapped according to the channel direction, the number of channels is changed from 1024 to 256 after being overlapped by convolution operation, and the fusion feature image fusing the feature images of the encoder at all levels is obtained. The decoder has 12 convolution operations and four full-connection layer operations with smaller sizes, so that the parameter quantity of the model can be reduced, and the running speed of the model can be improved. The deformable convolution unit can adjust the offset and the mask of the deformable operation convolution kernel aiming at the object with serious deformation, and better image characteristics are extracted;
as shown in fig. 5, the calculation method of the online prototype adaptation module of the present invention includes:
and a convolution unit, which is used for inputting a fusion feature map into the convolution unit to obtain an embedded feature, wherein the feature map is the final code of the image depth feature, the size of the feature map is 1/4 of the sample size, a source domain sample label and a mixed sample pseudo label t are adjusted to the embedded feature size through nearest interpolation, commonly-occurring categories are screened, the average value P of prototype corresponding pixels of each common category, namely similar pixel feature vectors, is calculated in the source domain embedded feature, feature vectors of pixels of the same category are selected in the mixed sample embedded feature E according to the mixed sample pseudo label t, the vectors are replaced by P, and the embedded feature obtained after the replacement is P. The online prototype adaptive loss calculation method is as follows in formula (2):
wherein lambda and tau are super parameters of 20, 0.9, CE andrespectively a cross entropy function and a SoftMax function. KL is a KL divergence function.
As shown in fig. 6, a method for performing rapid evaluation when applying a network model according to the present invention includes:
when the model is applied, the panoramic image is divided into four parts in the horizontal direction and spliced into a small batch along the batch direction, then the batch of images are input into the model for processing, and the processed result is spliced in the horizontal direction to obtain the result corresponding to the size of the original image.
A panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation is provided, wherein images are input into a network model according to batches in a training process to train, a loss function is calculated, and network parameters are updated in a back propagation mode. Setting the batch size as 4, adopting about 3000 and 2000 source domain samples and target domain samples, training 25000 times, adopting AdamW optimizer, model encoder and on-line prototype adaptation module learning rate as 0.0000375, decoder learning rate as 0.000375, beta 1 =0.9,β 2 =0.999,weight_decay=0.01, and the learning rate adopts a linearly decreasing scheduling strategy. According to the method, a Cityscapes data set is used as a labeled source domain data set, a panoramic image training data set WildPASS2K is used as a label-free target domain data set, and a model obtained through training is used for testing the accuracy result of the mIoU on a panoramic image test data set DensePASS, wherein the accuracy result is as follows: the result of the encoder (MiT b 2) adopting the mixed transformation network b2 level is 55.60%, the result is 0.85% different from the effect of the currently optimal panoramic image semantic segmentation algorithm (Trans 4PASS+ (S)), but the method is simpler in training process and faster in running speed than the optimal method, the optimal method needs to perform supervised training for 200 training periods in advance, and the method only needs 25000 iterations and can reach the speed of 30 frames per second; the result of using the hybrid transformation network b 5-level encoder (MiT b 5) was 58.27% higher than the optimal method by 1.82%.
The semantic segmentation test inputs the preprocessed panoramic image into the trained semantic segmentation network model in a batch mode by calling the trained semantic segmentation network model, and the final semantic segmentation label is calculated and output on the GPU in parallel according to forward propagation operation of the model.
The invention constructs a label-free training data set based on a panoramic camera, and a semantic segmentation network of an encoder and decoder structure based on a visual transformation network is used. According to the application scene requirement, the invention performs semantic segmentation processing on pictures shot by indoor and outdoor panoramic images in different environments, and solves the problems that the conventional panoramic image semantic segmentation algorithm needs to have high training requirement, low model precision and low model running speed during application.
The invention can be combined with other computer vision methods applied to panoramic images according to actual demands, such as example segmentation, vision SLAM and other algorithms, and the accuracy of the advanced computer vision recognition algorithm can be further improved. The above-described embodiments are merely illustrative of the present invention and are not intended to limit the scope of the present invention, and those skilled in the art can make various modifications and variations without departing from the spirit of the invention, but these modifications and variations are intended to fall within the scope of the invention as defined in the appended claims.
Reference is made to:
[1]Xie E,Wang W,Yu Z,et al.SegFormer:Simple and Efficient Design for Semantic Segmentation with Transformers[J].2021.
[2]Hoyer L,Dai D,Gool L V.DAFormer:Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation[J].2021。

Claims (6)

1. a method for semantic segmentation of a panoramic camera based on deep learning unsupervised domain adaptation, comprising: constructing a decoder of the deformable multistage feature fusion module, constructing an online prototype adaptation module, and adopting a rapid evaluation method when a network model is applied to the panoramic image;
the network model structure is a visual transformation network codec structure and comprises a visual transformation network encoder and a decoder consisting of a deformable multistage feature fusion module and a classifier; the visual transformation network encoder adopts a hybrid transformation network encoder, which comprises an overlapped patch embedding module and four transformation network modules and is used for extracting multi-scale characteristics of an image;
the deformable multistage feature fusion module comprises four groups of deformable convolution units, four groups of multilayer perceptron units and one convolution unit, and the four groups of deformable convolution units, the four groups of multilayer perceptron units and the one convolution unit are respectively used for further extracting features, aligning channels and fusing the channel into feature graphs of all scales; in a decoder, the multi-scale feature map is firstly further extracted by four groups of deformable convolution units and four groups of multi-layer perceptron units to align feature map channels; then, uniformly transforming the size into 1/4 of the size of the input image through bilinear interpolation, and obtaining a fusion feature map through a convolution unit fusion channel by the feature map of each scale obtained through superposition; finally, the fusion feature map is processed by a classifier in a decoder to obtain a probability map of image pixel-by-pixel segmentation; obtaining a final semantic segmentation label through bilinear interpolation, softMax and max operations;
the online prototype adaptation module comprises a convolution unit which is combined with unsupervised field adaptation training, and the fused feature map is further mapped to 256-channel embedded features; in the non-supervision domain adaptation training process, a labeled pinhole image dataset is used as a source domain, and a non-labeled panoramic image dataset is used as a target domain; the online prototype adaptation module screens commonly-occurring categories by using real labels of a source domain and pseudo labels generated in an unsupervised training process of a target domain; calculating vector centers of all common categories in the source domain embedded features, namely, class prototypes of the common categories; calculating the loss based on the knowledge distillation loss function between the embedded features of the target domain and the source domain embedded feature class prototype; the distribution distance of common class feature vectors of a source domain and a target domain is shortened in a hidden space, so that a model can refer to the feature vectors of classes learned by the source domain under supervision training to learn the feature vectors of the same class in the target domain better;
the method for quickly evaluating the panoramic image in the network model is characterized in that the panoramic image in the large scale is divided into four parts in the horizontal direction in advance when the network model is applied, the four parts are overlapped into a batch, then the batch is processed by the network model, and the segmented label images of all parts of the batch obtained after the processing are subjected to splicing and restoration to obtain the semantic segmented label image corresponding to the position of the original panoramic image.
2. The method of claim 1, wherein in the deformable multi-stage feature fusion module, each deformable convolution unit comprises three convolution operations, the first two convolution operations being used to predict an offset and a mask of a convolution kernel value of a third convolution operation, the third convolution operation convolving in combination with the offset and the mask; so that the feature map extraction of the deformed object in the panoramic image is processed on the premise of maintaining the low parameter quantity; the deformable multistage feature fusion module receives four feature graphs with different scales from the encoder, wherein the scales are 1/4, 1/8, 1/16 and 1/32 of the input size respectively, and four groups of input channel numbers are corresponding to the scales; the four-level feature images after being processed are spliced on the channels by the combination of different deformable convolution units and multi-layer perceptron units, and the channels are reduced to the designated channel number by one convolution unit, so that the features of deformed objects in the panoramic image are fused better.
3. The method for semantic segmentation of a panoramic camera according to claim 2, comprising the specific steps of:
firstly, shooting images by using a panoramic camera under indoor and outdoor environments, adjusting exposure time, focal length, aperture and intelligent brightness parameters of the camera according to environmental conditions and use requirements, and acquiring and storing original images under the condition of ensuring image quality;
screening the collected panoramic image, expanding by using a method of transforming polar coordinates into Cartesian coordinates, and cutting a blind area to obtain a complete panoramic image; constructing a label-free training image data set containing not less than 1000 images according to the collected and processed images; according to scene requirements, 50 to 100 images are additionally selected and marked for constructing a test data set;
selecting a pinhole image semantic segmentation dataset with a label, preprocessing the dataset, and selecting a plurality of categories required to be classified when semantic segmentation is carried out on a panoramic image;
loading pre-training weights for an encoder neural network and a decoder neural network based on a visual transformation network, initializing parameters of each network model required under an unsupervised field adaptation training framework, wherein the parameters comprise a student network and a teacher network, the student network is a semantic segmentation network to be obtained, a pinhole image dataset and an acquired panoramic image dataset are respectively used as a source domain and a target domain, images are selected according to batches and sent into the student network for training, a loss function is optimized, and unsupervised field adaptation training is performed;
and fifthly, semantic segmentation testing, namely, invoking a trained semantic segmentation network model, processing the panoramic image after being unfolded and cut through the rapid evaluation method, inputting the panoramic image into the model, and outputting a semantic segmentation image label.
4. A method of semantic segmentation for panoramic cameras according to claim 3, wherein the unsupervised domain adaptation training process comprises calculating each loss function value for forward propagation of the student network and updating network parameters for backward propagation operations; parameters for the teacher network are updated by parameters of the student network after each training iteration.
5. The method of claim 4, wherein the unsupervised domain adaptation training method is self-training; self-training is carried out on the pinhole image of the source domain for supervised training, and the panoramic image of the target domain is carried out on-line for unsupervised training in a pseudo-label generation mode; the self-training model comprises two models with the same structure, wherein the model expected to be trained is a student model, and the other model is a teacher model; the encoder of the student model before training loads the pre-training weight of the mixed transformation network encoder, the weight of the teacher model is initialized by the student model, and the parameters are updated by the student model through index moving average later; in the training process, a source domain image sample and a label are obtained through rare class screening, and a target domain image sample is obtained randomly; the student model firstly processes the source domain sample to obtain a segmentation output and a source domain sample embedding characteristic; calculating semantic segmentation loss and characteristic distance loss, and implementing back propagation to update student network model parameters; the teacher model processes the target domain sample to obtain a pseudo tag with high confidence coefficient of the target domain sample, and performs class mixing enhancement operation on the source domain sample, the source domain sample tag, the target domain sample and the target domain sample pseudo tag to obtain a mixed sample and a mixed sample pseudo tag; the student model processes the mixed sample to obtain mixed sample segmentation output and mixed sample embedding characteristics, calculates mixed sample semantic segmentation loss, calculates online prototype adaptive loss by utilizing the source domain sample embedding characteristics, the mixed sample embedding characteristics, the source domain labels and the mixed sample pseudo labels, and finally calculates gradient update student model parameters; the semantic segmentation loss function is calculated by using OhemCELSS, and the cross entropy loss function based on the OhemCELSS is shown as the following formula (1):
wherein y is a label, S θ The method is characterized in that the method is a student network, x is an input image, C is a category number, and H and W are the length and width of the image respectively; and (3) screening all pixel vectors with maximum probability reaching a threshold value of 0.7 in advance by OhemCELSs, calculating a loss value by using a formula (1), and if the number of the pixel vectors reaching the threshold value of 0.7 exceeds 3000, selecting the pixel vectors with maximum probability in the output probability map to calculate the loss value.
6. The method for semantic segmentation of a panoramic camera according to claim 4, wherein the line prototype adaptation module is a convolution unit, a fusion feature map is input into the convolution unit to obtain embedded features, the feature map is a final code of image depth features, the size of the feature map is 1/4 of the size of a sample, a source domain sample label and a mixed sample pseudo label t are adjusted to the size of the embedded features through nearest interpolation, commonly-occurring categories are screened, a prototype of corresponding pixels of each common category, namely an average value P of feature vectors of similar pixels, is calculated in the source domain embedded features, feature vectors of pixels of the same category are selected in the mixed sample embedded features E according to the mixed sample pseudo label t, the vectors are replaced by P, and the embedded features obtained after the replacement are P; the online prototype adaptive loss calculation method is as follows in formula (2):
wherein lambda, tau are superparameters, CE andrespectively a cross entropy function and a softMax function; KL is a KL divergence function.
CN202310844676.8A 2023-07-11 2023-07-11 Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation Pending CN116993975A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310844676.8A CN116993975A (en) 2023-07-11 2023-07-11 Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310844676.8A CN116993975A (en) 2023-07-11 2023-07-11 Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation

Publications (1)

Publication Number Publication Date
CN116993975A true CN116993975A (en) 2023-11-03

Family

ID=88529275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310844676.8A Pending CN116993975A (en) 2023-07-11 2023-07-11 Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation

Country Status (1)

Country Link
CN (1) CN116993975A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117830638A (en) * 2024-03-04 2024-04-05 厦门大学 Omnidirectional supervision semantic segmentation method based on prompt text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117830638A (en) * 2024-03-04 2024-04-05 厦门大学 Omnidirectional supervision semantic segmentation method based on prompt text

Similar Documents

Publication Publication Date Title
CN108986050B (en) Image and video enhancement method based on multi-branch convolutional neural network
WO2022252272A1 (en) Transfer learning-based method for improved vgg16 network pig identity recognition
CN109711413B (en) Image semantic segmentation method based on deep learning
CN111292264B (en) Image high dynamic range reconstruction method based on deep learning
CN110363716B (en) High-quality reconstruction method for generating confrontation network composite degraded image based on conditions
CN109886066B (en) Rapid target detection method based on multi-scale and multi-layer feature fusion
JP2019032821A (en) Data augmentation techniques using style transformation with neural network
CN110334589B (en) High-time-sequence 3D neural network action identification method based on hole convolution
CN113688723A (en) Infrared image pedestrian target detection method based on improved YOLOv5
CN104217404A (en) Video image sharpness processing method in fog and haze day and device thereof
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN110046598B (en) Plug-and-play multi-scale space and channel attention remote sensing image target detection method
CN111680705B (en) MB-SSD method and MB-SSD feature extraction network suitable for target detection
CN112287941B (en) License plate recognition method based on automatic character region perception
CN110363770B (en) Training method and device for edge-guided infrared semantic segmentation model
CN111310609B (en) Video target detection method based on time sequence information and local feature similarity
CN116993975A (en) Panoramic camera semantic segmentation method based on deep learning unsupervised field adaptation
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN115019173A (en) Garbage identification and classification method based on ResNet50
CN116863194A (en) Foot ulcer image classification method, system, equipment and medium
CN114581789A (en) Hyperspectral image classification method and system
CN114092467A (en) Scratch detection method and system based on lightweight convolutional neural network
CN113096133A (en) Method for constructing semantic segmentation network based on attention mechanism
CN115861595B (en) Multi-scale domain self-adaptive heterogeneous image matching method based on deep learning
CN116824330A (en) Small sample cross-domain target detection method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination