CN116310324A

CN116310324A - Pyramid cross-layer fusion decoder based on semantic segmentation

Info

Publication number: CN116310324A
Application number: CN202310169764.2A
Authority: CN
Inventors: 张颂扬; 任歌; 张亮; 林鸿
Original assignee: Zhengzhou Institute Of Advanced Measurement Technology
Current assignee: Zhengzhou Institute Of Advanced Measurement Technology
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-06-23

Abstract

The invention discloses an image semantic segmentation method based on a pyramid cross-layer Fusion decoder, which is characterized in that the decoder structure is optimized, the RCE is utilized to generate rich context information from a feature pyramid, the representation capability of a model is improved, and the cross-layer Fusion expansion in a ViT Adapter encoder is applied to the decoder through Fusion Block, so that the interactive Fusion of the context information and the spatial information is realized, and the semantic segmentation effect is improved.

Description

Pyramid cross-layer fusion decoder based on semantic segmentation

Technical Field

The invention relates to the technical field of image processing, in particular to a pyramid cross-layer fusion decoder based on semantic segmentation.

Background

According to the classification standard (namely, semantic standard) of things in daily life, classifying each pixel point of an input image, and giving the respective classification color of each pixel of the image according to the classification result, namely, coloring the image. Since the same classification is labeled with the same color, it is like to be separated (or segmented) from the input image with respect to the input image, so this technique is called semantic segmentation. As to how to generate the semantically segmented image, it is then the task of a semantically segmented model, which has an encoder-decoder architecture, the encoder is used for feature representation learning, and the decoder is used for pixel-level classification of the feature representations generated by the encoder. Existing semantic segmentation models can be divided into two classes: CNN-based and Transformer-based semantic segmentation models.

Semantic segmentation model based on CNN: CNN-based segmentation models can be divided into two classes from the characteristic of convolution: based on the dilation convolution and based on the normal convolution.

Among the models based on the dilation convolution are: PSPNet ^[13] Performing conventional convolution on the pyramid layer, and capturing multi-scale semantic information; deep Lab series ^[3-6] Parallel dilation convolutions of different dilation rates are employed (different dilation convolutions capture context information of different scales). Recent work ^[17-20] Various extension decoders are proposed, e.g. DenseASPP ^[14] Expansion convolution with larger expansion ratio, covering larger receptive fields, other studies ^[6,18] A codec structure is constructed that utilizes multi-resolution features as multi-scale contexts. DANet ^[2] And OCNet ^[17] Enhancing the representation of each pixel by aggregating representations of context pixels, where a context consists of all pixels, unlike a global context, these works contemplate self-attention-based schemes ^[27] And weighting and aggregating the pixels by taking the similarity as a weight, wherein a larger receptive field is still obtained through expansion convolution, and semantic information is fused.

Models based on common convolution are: FCN (fiber channel) ^[1] 、FPN ^[8] And UperNet ^[7] And the like, wherein FCN is the mountain-climbing operation of the semantic segmentation model, and the fusion of the features among all layers is realized through the up-sampling and splicing operation among pyramid feature graphs; FPN realizes each layer by upsampling and linear addition of features between pyramid feature graphsIs a fusion of (2); the UuperNet realizes the self-adaptive aggregation of the features through the pyramid pooling module to improve the characterization capability of the model.

The Transformer based segmentation model has completely changed neuro-linguistic processing techniques and has been very successful in terms of computer vision. ViT ^[26] Is the first end-to-end vision transformer for image classification by converting the input image into a sequence and attaching it to a class mark. DeiT ^[18] Through the distillation mode, a training strategy of teachers and students is introduced, and the training efficiency of ViT is improved. In addition to sequence-to-sequence model structures, PVT ^[19] And Swin transducer ^[11] Is attracting attention to Vision Transformer. ViT is also applied to solve the problems of downstream tasks and intensive predictions, especially with good performance in the parallel semantic segmentation direction driven by ViT. SETR (styrene-ethylene-butylene-styrene) ^[21] ViT is used as an encoder and the output Patch coding is up-sampled to classify the pixels. Unlike SETR, swin transducer and ViT Adapter ^[9] The idea of CNN is applied to the transducer (the main body of the model is still the transducer); the SwinTransformer reserves the pyramid structure of the output characteristic diagram of the traditional convolutional neural network encoder, and the reserved pyramid structure can be combined with a decoder of the traditional neural network, so that the visual downstream task based on the Transformer is realized; viT Adapter was used as a product of convolutional neural network and ViT transducer fusion to make up the performance gap between ViT and vision-specific transformers. Multiscale feature information is extracted by a design space priors module (Spatial Prior Module) and two feature interactions modules (Spatial Feature Injector and Multi-Scale Feature Extractor) without changing ViT structure.

In the semantic segmentation model based on CNN, the occurrence of the dilation convolution in the model based on the dilation convolution can cause the size of a decoder feature map of the semantic segmentation model to increase, and further cause the calculation amount of a follow-up attention mechanism of the model to increase. In the model based on the common convolution, due to the difference of characteristic information between deep layers and shallow layers, simple continuous up-sampling cannot enable deep layer characteristics to be fused with shallow layer characteristics better, and the fusion does not introduce a attention mechanism and lacks global characteristic information. The complexity of the UuperNet model is limited by the characteristic channel of the encoder characteristic pyramid, so that the calculation amount and floating point operation amount of the model are increased.

In the semantic segmentation model based on the Transformer, since CNN and Transformer are two different model structures, many model architectures based on the expansion convolution cannot be used on the Transformer. Because ViT transformers focus on similarity between features, lack of prior knowledge of spatial continuity results in reduced characterizations of the model. While Swin transducer and ViT Adapter take into account a priori knowledge of spatial continuity, the feature dimensions of the feature pyramid are too high, resulting in an increase in the number of parameters and floating point calculations of the model.

Reference is made to:

[1]J.Long,E.Shelhamer and T.Darrell,"Fully convolutional networks for semantic segmentation,"2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2015,pp.3431-3440,doi:10.1109/CVPR.2015.7298965.

[2]J.Fu et al.,"Dual Attention Network for Scene Segmentation,"2019IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019,pp.3141-3149,doi:10.1109/CVPR.2019.00326.

[3]Chen,L.,Papandreou,G.,Kokkinos,I.,Murphy,K.P.,&Yuille,A.L.(2015).Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs.CoRR,abs/1412.7062.

[4]Chen,L.,Papandreou,G.,Kokkinos,I.,Murphy,K.P.,&Yuille,A.L.(2018).DeepLab:Semantic Image Segmentation with Deep Convolutional Nets,Atrous Convolution,and Fully Connected CRFs.IEEE Transactions on Pattern Analysis and Machine Intelligence,40,834-848.

[5]Chen,L.,Papandreou,G.,Schroff,F.,&Adam,H.(2017).Rethinking Atrous Convolution for Semantic Image Segmentation.ArXiv,abs/1706.05587.

[6]Chen,L.,Zhu,Y.,Papandreou,G.,Schroff,F.,&Adam,H.(2018).Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation.ECCV.

[7]Xiao,T.,Liu,Y.,Zhou,B.,Jiang,Y.,Sun,J.(2018).Unified Perceptual Parsing for Scene Understanding.In:Ferrari,V.,Hebert,M.,Sminchisescu,C.,Weiss,Y.(eds)Computer Vision–ECCV 2018.ECCV 2018.Lecture Notes in Computer Science(),vol 11209.Springer,Cham.https://doi.org/10.1007/978-3-030-01228-1_26

[8]Lin,T.,Dollár,P.,Girshick,R.B.,He,K.,Hariharan,B.,&Belongie,S.J.(2017).Feature Pyramid Networks for Object Detection.2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),936-944.

[9]Chen,Z.,Duan,Y.,Wang,W.,He,J.,Lu,T.,Dai,J.,&Qiao,Y.(2022).Vision Transformer Adapter for Dense Predictions.ArXiv,abs/2205.08534.

[10]K.He,X.Zhang,S.Ren and J.Sun,"Deep Residual Learning for Image Recognition,"2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2016,pp.770-778,doi:10.1109/CVPR.2016.90.

[11]Z.Liu et al.,"Swin Transformer:Hierarchical Vision Transformer using Shifted Windows,"2021 IEEE/CVF International Conference on Computer Vision(ICCV),2021,pp.9992-10002,doi:10.1109/ICCV48922.2021.00986.

[12]M.Cordts et al.,"The Cityscapes Dataset for Semantic Urban Scene Understanding,"2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2016,pp.3213-3223,doi:10.1109/CVPR.2016.350.

[13]H.Zhao,J.Shi,X.Qi,X.Wang and J.Jia,"Pyramid Scene Parsing Network,"2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2017,pp.6230-6239,doi:10.1109/CVPR.2017.660.

[14]M.Yang,K.Yu,C.Zhang,Z.Li and K.Yang,"DenseASPP for Semantic Segmentation in Street Scenes,"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018,pp.3684-3692,doi:10.1109/CVPR.2018.00388.

[15]J.He,Z.Deng,L.Zhou,Y.Wang and Y.Qiao,"Adaptive Pyramid Context Network for Semantic Segmentation,"2019IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019,pp.7511-7520,doi:10.1109/CVPR.2019.00770.

[16]Z.Zhu,M.Xu,S.Bai,T.Huang and X.Bai,"Asymmetric Non-Local Neural Networks for Semantic Segmentation,"2019IEEE/CVF International Conference on Computer Vision(ICCV),2019,pp.593-602,doi:10.1109/ICCV.2019.00068.

[17]Yuan,Y.,&Wang,J.(2018).OCNet:Object Context Network for Scene Parsing.ArXiv,abs/1809.00916.

[18]Touvron,H.,Cord,M.,Douze,M.,Massa,F.,Sablayrolles,A.,&J'egou,H.(2021).Training data-efficient image transformers&distillation through attention.ICML.

[19]Wang,W.,Xie,E.,Li,X.,Fan,D.,Song,K.,Liang,D.,Lu,T.,Luo,P.,&Shao,L.(2021).Pyramid Vision Transformer:A Versatile Backbone for Dense Prediction without Convolutions.2021 IEEE/CVF International Conference on Computer Vision(ICCV),548-558.

[20]Zheng,S.,Lu,J.,Zhao,H.,Zhu,X.,Luo,Z.,Wang,Y.,Fu,Y.,Feng,J.,Xiang,T.,Torr,P.H.,&Zhang,L.(2021).Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),6877-6886.

[21]J.Fu,J.Liu,J.Jiang,Y.Li,Y.Bao and H.Lu,"Scene Segmentation With Dual Relation-Aware Attention Network,"in IEEE Transactions on Neural Networks and Learning Systems,vol.32,no.6,pp.2547-2560,June 2021,doi:10.1109/TNNLS.2020.3006524.

[22]Bousselham,W.,Thibault,G.,Pagano,L.,Machireddy,A.,Gray,J.,&Chang,Y.H.,et al.(2021).Efficient self-ensemble framework for semantic segmentation.

[23]Yuhui Yuan,Xiaokang Chen,Xilin Chen,and Jingdong Wang.Segmentation transformer:Object-contextual representations for semantic segmentation,2021.

[24]Sixiao Zheng,Jiachen Lu,Hengshuang Zhao,Xiatian Zhu,Zekun Luo,Yabiao Wang,Yanwei Fu,Jianfeng Feng,Tao Xiang,Philip H.S.Torr,and Li Zhang.Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,2021.

[25]Cheng,B.,Misra,I.,Schwing,A.G.,Kirillov,A.,&Girdhar,R.(2021).Masked-attention Mask Transformer for Universal Image Segmentation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),1280-1289.

[26]Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X.,Unterthiner,T.,Dehghani,M.,Minderer,M.,Heigold,G.,Gelly,S.,Uszkoreit,J.,&Houlsby,N.(2020).An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale.ArXiv,abs/2010.11929.

[27]Lin,Zhouhan,Minwei Feng,Cícero Nogueira dos Santos,Mo Yu,Bing Xiang,Bowen Zhou and YoshuaBengio.“A Structured Self-attentive Sentence Embedding.”ArXiv abs/1703.03130(2017):n.pag.

[28]Raghu,Maithra,Thomas Unterthiner,Simon Kornblith,Chiyuan Zhang and Alexey Dosovitskiy.“Do Vision Transformers See Like Convolutional Neural Networks？”Neural Information Processing Systems(2021).

disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a pyramid cross-layer fusion decoder based on semantic segmentation.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

an image semantic segmentation method based on a pyramid cross-layer fusion decoder comprises the following specific processes:

s1, inputting an image;

s2, preprocessing data;

s3, sending the image processed in the step S2 into an encoder to generate an original characteristic pyramid F ₁ 、F ₂ 、F ₃ 、F ₄ ；

S4, step S2.4, the original feature pyramid is sent to a pyramid cross-layer fusion decoder; in the pyramid cross-layer fusion decoder, an original feature pyramid is embedded into an RCE (remote control) through an enhanced context and a fusion module fusion block to generate a feature pyramid F ₁ ^* 、F ₂ ^* 、F ₃ ^* 、F ₄ ^* The method comprises the steps of carrying out a first treatment on the surface of the Then F ₁ ^* 、F ₂ ^* 、F ₃ ^* 、F ₄ ^* Sending the FCFPN to output a final semantic segmentation result; wherein the FCFPN is an UuperNet decoder without a pooling pyramid module; the specific process is as follows:

s4.1, constructing space information: by spatial features inherent in the encoder itself, i.e. F ₂ 、F ₃ And F ₄ As spatial information, i.e.

Wherein H is _i 、W _i Respectively is a characteristic diagram F _i I=2, 3,4, D is the embedding dimension, which is the same as the context information dimension formed by the enhanced context embedding RCE;

s4.2, generating context information:

directly using the characteristic map F of the encoder output _i I= (2, 3, 4), the channel is first compressed by a convolution Conv with a convolution kernel size of 1 x 1; next, F is carried out ₂ 、F ₄ Up-sampling and down-sampling to F, respectively ₃ Is formed at this time by the size of F' ₂ 、F' ₃ 、F' ₄ Is passed through a convolution Conv with a convolution kernel size of 1×1 and performs a Flatten operation to form context information

Wherein D is the embedding dimension;

s4.3, the fusion module comprises three parts of an injector, an extractor and a cross-window attention module Swin Block, wherein the injector and the extractor are a spatial feature injector and a multi-scale feature extractor in a ViT Adapter, the injector fuses the feature attention of spatial information into context information, the extractor gives the feature attention of the context information into the spatial information, and the cross-window attention module is used for realizing a cross-window attention mechanism;

the space information and the context information are generated to obtain F after passing through a fusion module ₂ ^* 、F ₃ ^* 、F ₄ ^* ；

S4.4F obtained in the step S4.3 ₂ ^* 、F ₃ ^* 、F ₄ ^* And F ₁ After the cross-layer fusion module CLGD, F is obtained ₁ ^* ；

S4.5, finally, F ₁ ^* 、F ₂ ^* 、F ₃ ^* 、F ₄ ^* Sending the semantic segmentation result to an UuperNet decoder without a pooling pyramid module, and outputting the final semantic segmentation result.

Further, the specific process of step S2 is as follows:

s2.1, normalization: image F of RGB ₀ Normalizing the values of the three channels, namely F ₀ ⁿ ＝F ₀ /255＝[F _R ,F _G ,F _B ]/255, where F _R ,F _G ,F _B The size of the (C) is 1024 multiplied by 2048;

s2.2, standardization: normalization in terms of the R, G, B three-channel dimension, i.e. F ₀ ^ns ＝(F ₀ ⁿ -mean)/std, wherein mean= [0.485,0.456,0.406]，std＝[0.229,0.224,0.225]。

The present invention also provides a computer readable storage medium having stored therein a computer program which when executed by a processor implements the above method.

The invention also provides a computer device comprising a processor and a memory for storing a computer program; the processor is configured to implement the above-described method when executing the computer program.

The invention has the beneficial effects that: in the method, the decoder structure is optimized, the RCE is utilized to generate rich context information from the feature pyramid, the representation capability of the model is improved, and the cross-layer Fusion expansion in the ViT Adapter encoder is applied to the decoder through Fusion Block, so that the interactive Fusion of the context information and the space information is realized, and the semantic segmentation effect is improved.

Drawings

FIG. 1 is a schematic diagram of the overall architecture of a method implementation of embodiment 1 of the present invention;

FIG. 2 is a diagram showing the operation of RCE in example 1 of the present invention;

FIG. 3 is a schematic diagram of the structure of the Vit Adapter encoder and the fusion block;

fig. 4 is a schematic view of the feature map visualization in embodiment 2 of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that, while the present embodiment provides a detailed implementation and a specific operation process on the premise of the present technical solution, the protection scope of the present invention is not limited to the present embodiment.

The embodiment provides an image semantic segmentation method based on a pyramid cross-layer fusion decoder, as shown in fig. 1, comprising the following specific processes:

s1, inputting an image: the image is RGB image F ₀ ，F ₀ The size and shape of the digital representation is 3×1024×2048; wherein 3 is three channels of R (red), G (green) and B (blue), 1024×2048 is the width and height of the image;

s2, data preprocessing:

s2.1, normalization: to facilitate model extraction of features, RGB image F is required ₀ The values on the three channels are normalized (normalized), i.e. F ₀ ⁿ ＝F ₀ /255＝[F _R ,F _G ,F _B ]/255, where F _R ,F _G ,F _B The size of the (C) is 1024 multiplied by 2048;

s2.2, standardization: in order to make the model converge faster, the present embodiment performs standardization (standardization) according to R, G, B three channel dimension directions, i.e. F ₀ ^ns ＝(F ₀ ⁿ -mean)/std, whichMiddle mean= [0.485,0.456,0.406 ]]，std＝[0.229,0.224,0.225]；

S3, sending the image processed in the step S2 into an encoder to generate an original characteristic pyramid F ₁ 、F ₂ 、F ₃ 、F ₄ . In this embodiment, swin-Larger is used as the encoder.

S4, sending the original feature pyramid obtained in the step S2.4 into a pyramid cross-layer fusion decoder; in the pyramid cross-layer fusion decoder, an original feature pyramid is embedded into an RCE (remote control) through an enhanced context and a fusion module fusion block to generate a feature pyramid F ₁ ^* 、F ₂ ^* 、F ₃ ^* 、F ₄ ^* The method comprises the steps of carrying out a first treatment on the surface of the Then F ₁ ^* 、F ₂ ^* 、F ₃ ^* 、F ₄ ^* Sending the FCFPN to output a final semantic segmentation result; wherein FCFPN is UperNet decoder without pooling pyramid module (Pooling Pyramid Module, PPM).

Specifically, if an object in an image scene is classified into N classes, the output size o= [ O ] of the model ₀ ,O ₁ ,...,O _N-1 ]Wherein O is _i Is 1024X 2048 in size, O _i Representing image position H _i ,W _i ]The probability of the ith class is output. Select [ H ] _i ,W _i ]The channel index at the maximum probability value is used as the class label to which the pixel belongs, and all the class labels belonging to the same class label are marked with the same color.

As shown in fig. 1, enhanced context embedding (RCE, reinforce Context Embedding) is used to obtain semantic information (F) of a feature map _c ) The method comprises the steps of carrying out a first treatment on the surface of the Fusion module Fusion Block is used for fusing context information (Context Information) and spatial information (spacial information); CLGD ^[20] For semantic information F _c ^* And F ₁ Is a cross-layer fusion module of (2); FCFPN is UperNet decoder without pooling pyramid module PPM.

In this embodiment, the specific process of step S4 is as follows:

s4.1, constructing space information. Construction method of space information and visual transducer adaptationThe spatial prior module (Spacial Prior Module) in the Adapter ViT Adapter is similar, except that the method of the present embodiment is based on the spatial characteristics of the encoder itself (i.e. F ₂ 、F ₃ And F ₄ ) As spatial information, i.e.

Wherein H is _i 、W _i Respectively is a characteristic diagram F _i (i=2, 3, 4), D is the Embedding dimension (Embedding Dim), which is the value with the enhanced context Embedding RCE formed context information F _c The dimensions are the same.

S4.2, generating context information.

Since the model size limitation of the UperNet decoder is limited by the number of feature channels of the feature pyramid of the encoder, the method of the embodiment performs feature compression on the features generated by the encoder through the embedding dimension.

Specifically, to enhance the characterizability of the model, the method of this embodiment compresses the original feature pyramid by RCE to form context information, as shown in fig. 2. Because the traditional convolutional neural network structure has the Patch Embedding (Patch Embedding) function, namely, the information extraction of the network under the same scale is realized through convolution and downsampling, the characteristic output of the convolutional neural network can be directly utilized as the multi-scale Patch Embedding. Furthermore, the number of channels is controlled by the embedding dimension.

More specifically, the context information is formed by RCE by directly using the characteristic diagram F of the encoder output as shown in FIG. 2 _i I= (2, 3, 4), the channel is first compressed by a convolution Conv with a convolution kernel size of 1 x 1; then, to be able to splice in the channel dimension, F ₂ 、F ₄ Up-sampling and down-sampling to F, respectively ₃ Is formed at this time by the size of F' ₂ 、F' ₃ 、F' ₄ Is passed through a convolution Conv with a convolution kernel size of 1×1 and performs a Flatten operation to form context information

Wherein D is an inlayDimension into.

S4.3, the context information F has been obtained by steps S4.1 and S4.2 _c And spatial information F _sp Next, interaction of the two information is to be achieved. In this embodiment, the design concept of the Fusion Block derives from the Cross attribute concept in the ViT Adapter, unlike the ViT Adapter, which is shown in fig. 3 (a), the ViT Adapter uses ViT as a main body, but multiple ViT blocks using ViT as a main body can cause an increase in computational complexity of the model, so that in order to reduce the complexity of the model, the method in this embodiment, when designing the Fusion Block of the decoder, the correction of semantic information is realized only by one ViT Block, and in order to realize the Attention of a Cross window, the ViT Block is replaced by the Swin Block, so that the problem that the Attention mechanism is difficult to be introduced on a single scale due to the size problem between layers of the feature pyramid is solved.

Specifically, the Fusion Block has a structure as shown in fig. 3 (b), and comprises three parts, namely an Injector, an Extractor and a cross-window attention module Swin Block, wherein the Injector and the Extractor are a spatial feature Injector Spatial Feature Injector and a Multi-scale feature Extractor Multi-Scale Feature Extractor in a ViT Adapter, the Injector fuses the feature attention of spatial information into context information, and the Extractor imparts the feature attention of the context information into the spatial information, so that deep features of a feature pyramid can better act on shallow features; the cross-window attention module Swin Block is used for realizing a cross-window attention mechanism.

The spatial information and the context information are generated to obtain F after Fusion Block ₂ ^* 、F ₃ ^* 、F ₄ ^* ；

S4.4F obtained in the step S4.3 ₂ ^* 、F ₃ ^* 、F ₄ ^* And F ₁ After the cross-layer fusion module CLGD, F is obtained ₁ ^* 。

Example 2

To demonstrate the effectiveness of the method of example 1, this example is directed to the Cityscapes dataset ^[12] The following ablation experiments were performed above:

1. evaluation criteria of model

In this embodiment, the model is evaluated by three criteria: average cross ratio Mean Intersection over union (Mean IoU), floating point operand floating point operations (FLOPs), parameter Parameters (Param)

①Mean IoU：

TP: in fact, is a positive sample, and the model prediction result is the number of samples of the positive sample;

TN: in fact, is a negative sample, and the model prediction result is the number of samples of the negative sample;

FP: in fact a negative sample, but the model prediction result is the number of samples of the positive sample;

FN: in fact positive samples, but the model prediction result is the number of samples of negative samples; so that

And Mean IoU is the average value of all the category cross ratios in the data set. The larger the value, the higher the accuracy of the model.

(2) FLOPs: referring to floating point operands, it can be understood as the amount of computation used to measure the computational complexity of the model. The larger the value, the larger the calculation amount of the model.

(3) Param: the reference number is the number of the learnable parameters of a model and is used for measuring the size of the model. The larger the value, the more space the model occupies.

2. Ablation experiments of RCE

Example 1 method enriched semantic information is obtained by RCE. To demonstrate the role of RCE, this example devised a fusion strategy of different layers, as shown in table 1.

TABLE 1

Table 1 shows the results of an RCE ablation experiment on a Cityscapes validation set, where RCE refers to the backhaul network output layer F involved in RCE _i 。

As shown in table 1, RCE significantly improved the accuracy of segmentation. F compared to FCFPN ₂ 、F ₃ And F ₄ The accuracy as input to the RCE feature is ultimately improved by 1.97% (Res 50) and 1.91 (Res 101). It can be seen that F will be found without significantly increasing the amount of model calculation and the amount of parameters ₂ 、F ₃ And F ₄ As an ideal choice for an Embedding Layer. For semantic segmentation, context Information is the key of feature map information representation, and feature maps under different scales can pay attention to different context information, so that the segmentation accuracy of a model is improved. After that, RCE defaults to F ₂ 、F ₃ And F ₄ As an assembled Layer.

3. Ablation experiments of Fusion Block (FB)

TABLE 2

Table 2 shows the results of Fusion Block (FB) and RCE ablation experiments on the Cityscapes dataset, the embodiment selects the deepest feature F due to the model nature of FB itself, i.e., the two-dimensional feature map must be serialized by a Patch Embedding ₄ As input to the Patch encoding.

As shown in table 2, the present embodiment still uses FCFPN as a baseline (baseline), and first, in order to determine the influence of FB and RCE on model accuracy, the above experiments were performed separately, and regarding FB, there was a 5.71% improvement and a 4.85% improvement in each case under the conditions that the backspan is Res50 and Res101, respectively, compared with FCFPN; while using FB and RCE at the same time, there was a 6.09% and 5.24% improvement, respectively, to illustrate the impact of FB and RCE on model accuracy.

4. Super parameter setting of model

Because different coding Dim and head are set in the ViT Adapter, different-scale encoders are designed. Based on this idea, this embodiment also sets decoder settings in five different modes to investigate the performance of a pyramid cross-layer fusion decoder (hereinafter referred to as PCFD) in different modes. As shown in table 3 below, PCFD was divided into five modes, tiny, small, base and large, respectively, and in the previous ablation experiments, the setting adopted by default in this example was the tini mode.

TABLE 3 Table 3

Table 3 shows the super parameter settings of PCFD; the Embedding Dim represents feature dimensions (feature channel number) of the Context information and the spatial information, and Head (Space, context, swin) represents the number of attention heads in the Injector, the Extractor, and the cross-window attention module Swin Block.

TABLE 4 Table 4

Table 4 shows the experimental results on the Cityscapes dataset for different modes, where

Indicating that the model adopts OHEM during training ^[17] Is provided.

From the experimental results, PCFD does not have obvious improvement on model precision along with the improvement of model modes. Only the tiny mode is selected as the default model hyper-parameter configuration in subsequent experiments.

Example 3

To demonstrate the advantages of PCFD, this example was directed to the following comparative experiments: the experiment was still performed on the Cityscapes dataset.

TABLE 5

/>

Table 5 shows a comparison over the Cityscapes dataset, where FLOPs are floating point operands of the model at the same input size, representing the manner in which the model structure is trimmed to the training of the other dataset,

the model adopts OHEM during training ^[26] Is provided. # denotes that the crop_size of the model at training is 896×896.OM denotes Out of Memory.

Table 5 shows the results of the most advanced method on Cityscapes. The two groups are divided into a first group of tests with CNN as a backup, and a second group of tests with Tranformer as a backup. In this dataset, PCFD is superior to other methods, both in terms of model parameters, floating point calculations, and in terms of model accuracy, when the standard Resnet is chosen as representative of CNN. While when Swin-L and ViT-Adapter-L are used as backbones, PCFD is 0.8% different from the best model accuracy (mIoU), but the parameter quantity and floating point calculation quantity of PCFD are reduced by 23% and 68% respectively compared with ViT-adapt-UuperNet.

Feature visualization: because of the different nature of the different encoders themselves, we divide the encoders into three classes, CNN, viT Adapter, respectively, whereas PFD functions slightly differently in these three classes.

As shown in FIG. 4, this embodiment visualizes three types of knitting ResNet, swin transducer and Vit AdapterCharacteristic diagram F of encoder _i And a feature map F obtained by RCE and Fusion Block respectively _i ^* Wherein i is [1,4 ]]In which ResNet101, swin Larget and ViT Adapter are taken as examples, respectively. From the figure, it can be seen that there is a great difference in the feature map visualizations of ResNet101 and Swin Larget, the root cause of which is the most recent work ^[28] Has already been mentioned.

But both are for F ₁ After feature fusion, the edges of the segmented object and the texture inside the outline of the object can be clearly mapped.

In R101, since the spatial local feature is the focus of the convolutional network, compression of the feature channel, introduction of the attention mechanism, and fusion of the deep features and the shallow features are achieved after passing through the PCFD.

In the Swin Transformer, since the Transformer focuses on the similarity of global features, semantic distinction of shallow features is enhanced by cross-layer fusion of context information and spatial information obtained through RCE after PCFD. The effect of PCFD in Swin transformers is therefore the compression of the characteristic channels, the fusion of deep features with shallow features.

In the Vit Adapter, the present embodiment selects a portion of the original image as the input of the model due to the characteristics of the Vit Adapter itself (the width and height of the input image must be uniform), and scales the feature map to the same size as the R101-PCFD and Swin-PCFD feature maps. From the visual characteristic diagram, the characteristic distinction between different classifications increases after PCFD, and the characteristic is not greatly lost after characteristic compression, the shallow characteristic F ₂ Deep context information is fused as well.

Various modifications and variations of the present invention will be apparent to those skilled in the art in light of the foregoing teachings and are intended to be included within the scope of the following claims.

Claims

1. The image semantic segmentation method based on the pyramid cross-layer fusion decoder is characterized by comprising the following specific steps of:

s1, inputting an image;

s2, preprocessing data;

S4, sending the original feature pyramid obtained in the step S2.4 into a pyramid cross-layer fusion decoder; in a pyramid cross-layer Fusion decoder, an original feature pyramid is embedded into an RCE (remote control entity) and Fusion module Fusion block through an enhanced context to generate a feature pyramid F ₁ ^* 、F ₂ ^* 、F ₃ ^* 、F ₄ ^* The method comprises the steps of carrying out a first treatment on the surface of the Then F ₁ ^* 、F ₂ ^* 、F ₃ ^* 、F ₄ ^* Sending the FCFPN to output a final semantic segmentation result; wherein the FCFPN is an UuperNet decoder without a pooling pyramid module; the specific process is as follows:

s4.2, generating context information:

directly using the characteristic map F of the encoder output _i I= (2, 3, 4), the channel is first compressed by a convolution Conv with a convolution kernel size of 1 x 1; next, F is carried out ₂ 、F ₄ Up-sampling and down-sampling to F, respectively ₃ Is formed at this time by the size of F' ₂ 、F′ ₃ 、F′ ₄ Is passed through a convolution Conv with a convolution kernel size of 1×1 and performs a Flatten operation to form context information

Wherein D is the embedding dimension;

2. The method according to claim 1, wherein the specific process of step S2 is:

3. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-2.

4. A computer device comprising a processor and a memory, the memory for storing a computer program; the processor being adapted to implement the method of any of claims 1-2 when the computer program is executed.