CN116310324A - Pyramid cross-layer fusion decoder based on semantic segmentation - Google Patents

Pyramid cross-layer fusion decoder based on semantic segmentation Download PDF

Info

Publication number
CN116310324A
CN116310324A CN202310169764.2A CN202310169764A CN116310324A CN 116310324 A CN116310324 A CN 116310324A CN 202310169764 A CN202310169764 A CN 202310169764A CN 116310324 A CN116310324 A CN 116310324A
Authority
CN
China
Prior art keywords
pyramid
feature
cross
fusion
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310169764.2A
Other languages
Chinese (zh)
Inventor
张颂扬
任歌
张亮
林鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Institute Of Advanced Measurement Technology
Original Assignee
Zhengzhou Institute Of Advanced Measurement Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Institute Of Advanced Measurement Technology filed Critical Zhengzhou Institute Of Advanced Measurement Technology
Priority to CN202310169764.2A priority Critical patent/CN116310324A/en
Publication of CN116310324A publication Critical patent/CN116310324A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image semantic segmentation method based on a pyramid cross-layer Fusion decoder, which is characterized in that the decoder structure is optimized, the RCE is utilized to generate rich context information from a feature pyramid, the representation capability of a model is improved, and the cross-layer Fusion expansion in a ViT Adapter encoder is applied to the decoder through Fusion Block, so that the interactive Fusion of the context information and the spatial information is realized, and the semantic segmentation effect is improved.

Description

Pyramid cross-layer fusion decoder based on semantic segmentation
Technical Field
The invention relates to the technical field of image processing, in particular to a pyramid cross-layer fusion decoder based on semantic segmentation.
Background
According to the classification standard (namely, semantic standard) of things in daily life, classifying each pixel point of an input image, and giving the respective classification color of each pixel of the image according to the classification result, namely, coloring the image. Since the same classification is labeled with the same color, it is like to be separated (or segmented) from the input image with respect to the input image, so this technique is called semantic segmentation. As to how to generate the semantically segmented image, it is then the task of a semantically segmented model, which has an encoder-decoder architecture, the encoder is used for feature representation learning, and the decoder is used for pixel-level classification of the feature representations generated by the encoder. Existing semantic segmentation models can be divided into two classes: CNN-based and Transformer-based semantic segmentation models.
Semantic segmentation model based on CNN: CNN-based segmentation models can be divided into two classes from the characteristic of convolution: based on the dilation convolution and based on the normal convolution.
Among the models based on the dilation convolution are: PSPNet [13] Performing conventional convolution on the pyramid layer, and capturing multi-scale semantic information; deep Lab series [3-6] Parallel dilation convolutions of different dilation rates are employed (different dilation convolutions capture context information of different scales). Recent work [17-20] Various extension decoders are proposed, e.g. DenseASPP [14] Expansion convolution with larger expansion ratio, covering larger receptive fields, other studies [6,18] A codec structure is constructed that utilizes multi-resolution features as multi-scale contexts. DANet [2] And OCNet [17] Enhancing the representation of each pixel by aggregating representations of context pixels, where a context consists of all pixels, unlike a global context, these works contemplate self-attention-based schemes [27] And weighting and aggregating the pixels by taking the similarity as a weight, wherein a larger receptive field is still obtained through expansion convolution, and semantic information is fused.
Models based on common convolution are: FCN (fiber channel) [1] 、FPN [8] And UperNet [7] And the like, wherein FCN is the mountain-climbing operation of the semantic segmentation model, and the fusion of the features among all layers is realized through the up-sampling and splicing operation among pyramid feature graphs; FPN realizes each layer by upsampling and linear addition of features between pyramid feature graphsIs a fusion of (2); the UuperNet realizes the self-adaptive aggregation of the features through the pyramid pooling module to improve the characterization capability of the model.
The Transformer based segmentation model has completely changed neuro-linguistic processing techniques and has been very successful in terms of computer vision. ViT [26] Is the first end-to-end vision transformer for image classification by converting the input image into a sequence and attaching it to a class mark. DeiT [18] Through the distillation mode, a training strategy of teachers and students is introduced, and the training efficiency of ViT is improved. In addition to sequence-to-sequence model structures, PVT [19] And Swin transducer [11] Is attracting attention to Vision Transformer. ViT is also applied to solve the problems of downstream tasks and intensive predictions, especially with good performance in the parallel semantic segmentation direction driven by ViT. SETR (styrene-ethylene-butylene-styrene) [21] ViT is used as an encoder and the output Patch coding is up-sampled to classify the pixels. Unlike SETR, swin transducer and ViT Adapter [9] The idea of CNN is applied to the transducer (the main body of the model is still the transducer); the SwinTransformer reserves the pyramid structure of the output characteristic diagram of the traditional convolutional neural network encoder, and the reserved pyramid structure can be combined with a decoder of the traditional neural network, so that the visual downstream task based on the Transformer is realized; viT Adapter was used as a product of convolutional neural network and ViT transducer fusion to make up the performance gap between ViT and vision-specific transformers. Multiscale feature information is extracted by a design space priors module (Spatial Prior Module) and two feature interactions modules (Spatial Feature Injector and Multi-Scale Feature Extractor) without changing ViT structure.
In the semantic segmentation model based on CNN, the occurrence of the dilation convolution in the model based on the dilation convolution can cause the size of a decoder feature map of the semantic segmentation model to increase, and further cause the calculation amount of a follow-up attention mechanism of the model to increase. In the model based on the common convolution, due to the difference of characteristic information between deep layers and shallow layers, simple continuous up-sampling cannot enable deep layer characteristics to be fused with shallow layer characteristics better, and the fusion does not introduce a attention mechanism and lacks global characteristic information. The complexity of the UuperNet model is limited by the characteristic channel of the encoder characteristic pyramid, so that the calculation amount and floating point operation amount of the model are increased.
In the semantic segmentation model based on the Transformer, since CNN and Transformer are two different model structures, many model architectures based on the expansion convolution cannot be used on the Transformer. Because ViT transformers focus on similarity between features, lack of prior knowledge of spatial continuity results in reduced characterizations of the model. While Swin transducer and ViT Adapter take into account a priori knowledge of spatial continuity, the feature dimensions of the feature pyramid are too high, resulting in an increase in the number of parameters and floating point calculations of the model.
Reference is made to:
[1]J.Long,E.Shelhamer and T.Darrell,"Fully convolutional networks for semantic segmentation,"2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2015,pp.3431-3440,doi:10.1109/CVPR.2015.7298965.
[2]J.Fu et al.,"Dual Attention Network for Scene Segmentation,"2019IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019,pp.3141-3149,doi:10.1109/CVPR.2019.00326.
[3]Chen,L.,Papandreou,G.,Kokkinos,I.,Murphy,K.P.,&Yuille,A.L.(2015).Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs.CoRR,abs/1412.7062.
[4]Chen,L.,Papandreou,G.,Kokkinos,I.,Murphy,K.P.,&Yuille,A.L.(2018).DeepLab:Semantic Image Segmentation with Deep Convolutional Nets,Atrous Convolution,and Fully Connected CRFs.IEEE Transactions on Pattern Analysis and Machine Intelligence,40,834-848.
[5]Chen,L.,Papandreou,G.,Schroff,F.,&Adam,H.(2017).Rethinking Atrous Convolution for Semantic Image Segmentation.ArXiv,abs/1706.05587.
[6]Chen,L.,Zhu,Y.,Papandreou,G.,Schroff,F.,&Adam,H.(2018).Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation.ECCV.
[7]Xiao,T.,Liu,Y.,Zhou,B.,Jiang,Y.,Sun,J.(2018).Unified Perceptual Parsing for Scene Understanding.In:Ferrari,V.,Hebert,M.,Sminchisescu,C.,Weiss,Y.(eds)Computer Vision–ECCV 2018.ECCV 2018.Lecture Notes in Computer Science(),vol 11209.Springer,Cham.https://doi.org/10.1007/978-3-030-01228-1_26
[8]Lin,T.,Dollár,P.,Girshick,R.B.,He,K.,Hariharan,B.,&Belongie,S.J.(2017).Feature Pyramid Networks for Object Detection.2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),936-944.
[9]Chen,Z.,Duan,Y.,Wang,W.,He,J.,Lu,T.,Dai,J.,&Qiao,Y.(2022).Vision Transformer Adapter for Dense Predictions.ArXiv,abs/2205.08534.
[10]K.He,X.Zhang,S.Ren and J.Sun,"Deep Residual Learning for Image Recognition,"2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2016,pp.770-778,doi:10.1109/CVPR.2016.90.
[11]Z.Liu et al.,"Swin Transformer:Hierarchical Vision Transformer using Shifted Windows,"2021 IEEE/CVF International Conference on Computer Vision(ICCV),2021,pp.9992-10002,doi:10.1109/ICCV48922.2021.00986.
[12]M.Cordts et al.,"The Cityscapes Dataset for Semantic Urban Scene Understanding,"2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2016,pp.3213-3223,doi:10.1109/CVPR.2016.350.
[13]H.Zhao,J.Shi,X.Qi,X.Wang and J.Jia,"Pyramid Scene Parsing Network,"2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2017,pp.6230-6239,doi:10.1109/CVPR.2017.660.
[14]M.Yang,K.Yu,C.Zhang,Z.Li and K.Yang,"DenseASPP for Semantic Segmentation in Street Scenes,"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018,pp.3684-3692,doi:10.1109/CVPR.2018.00388.
[15]J.He,Z.Deng,L.Zhou,Y.Wang and Y.Qiao,"Adaptive Pyramid Context Network for Semantic Segmentation,"2019IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019,pp.7511-7520,doi:10.1109/CVPR.2019.00770.
[16]Z.Zhu,M.Xu,S.Bai,T.Huang and X.Bai,"Asymmetric Non-Local Neural Networks for Semantic Segmentation,"2019IEEE/CVF International Conference on Computer Vision(ICCV),2019,pp.593-602,doi:10.1109/ICCV.2019.00068.
[17]Yuan,Y.,&Wang,J.(2018).OCNet:Object Context Network for Scene Parsing.ArXiv,abs/1809.00916.
[18]Touvron,H.,Cord,M.,Douze,M.,Massa,F.,Sablayrolles,A.,&J'egou,H.(2021).Training data-efficient image transformers&distillation through attention.ICML.
[19]Wang,W.,Xie,E.,Li,X.,Fan,D.,Song,K.,Liang,D.,Lu,T.,Luo,P.,&Shao,L.(2021).Pyramid Vision Transformer:A Versatile Backbone for Dense Prediction without Convolutions.2021 IEEE/CVF International Conference on Computer Vision(ICCV),548-558.
[20]Zheng,S.,Lu,J.,Zhao,H.,Zhu,X.,Luo,Z.,Wang,Y.,Fu,Y.,Feng,J.,Xiang,T.,Torr,P.H.,&Zhang,L.(2021).Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),6877-6886.
[21]J.Fu,J.Liu,J.Jiang,Y.Li,Y.Bao and H.Lu,"Scene Segmentation With Dual Relation-Aware Attention Network,"in IEEE Transactions on Neural Networks and Learning Systems,vol.32,no.6,pp.2547-2560,June 2021,doi:10.1109/TNNLS.2020.3006524.
[22]Bousselham,W.,Thibault,G.,Pagano,L.,Machireddy,A.,Gray,J.,&Chang,Y.H.,et al.(2021).Efficient self-ensemble framework for semantic segmentation.
[23]Yuhui Yuan,Xiaokang Chen,Xilin Chen,and Jingdong Wang.Segmentation transformer:Object-contextual representations for semantic segmentation,2021.
[24]Sixiao Zheng,Jiachen Lu,Hengshuang Zhao,Xiatian Zhu,Zekun Luo,Yabiao Wang,Yanwei Fu,Jianfeng Feng,Tao Xiang,Philip H.S.Torr,and Li Zhang.Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,2021.
[25]Cheng,B.,Misra,I.,Schwing,A.G.,Kirillov,A.,&Girdhar,R.(2021).Masked-attention Mask Transformer for Universal Image Segmentation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),1280-1289.
[26]Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X.,Unterthiner,T.,Dehghani,M.,Minderer,M.,Heigold,G.,Gelly,S.,Uszkoreit,J.,&Houlsby,N.(2020).An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale.ArXiv,abs/2010.11929.
[27]Lin,Zhouhan,Minwei Feng,Cícero Nogueira dos Santos,Mo Yu,Bing Xiang,Bowen Zhou and YoshuaBengio.“A Structured Self-attentive Sentence Embedding.”ArXiv abs/1703.03130(2017):n.pag.
[28]Raghu,Maithra,Thomas Unterthiner,Simon Kornblith,Chiyuan Zhang and Alexey Dosovitskiy.“Do Vision Transformers See Like Convolutional Neural Networks?”Neural Information Processing Systems(2021).
disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a pyramid cross-layer fusion decoder based on semantic segmentation.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
an image semantic segmentation method based on a pyramid cross-layer fusion decoder comprises the following specific processes:
s1, inputting an image;
s2, preprocessing data;
s3, sending the image processed in the step S2 into an encoder to generate an original characteristic pyramid F 1 、F 2 、F 3 、F 4
S4, step S2.4, the original feature pyramid is sent to a pyramid cross-layer fusion decoder; in the pyramid cross-layer fusion decoder, an original feature pyramid is embedded into an RCE (remote control) through an enhanced context and a fusion module fusion block to generate a feature pyramid F 1 * 、F 2 * 、F 3 * 、F 4 * The method comprises the steps of carrying out a first treatment on the surface of the Then F 1 * 、F 2 * 、F 3 * 、F 4 * Sending the FCFPN to output a final semantic segmentation result; wherein the FCFPN is an UuperNet decoder without a pooling pyramid module; the specific process is as follows:
s4.1, constructing space information: by spatial features inherent in the encoder itself, i.e. F 2 、F 3 And F 4 As spatial information, i.e.
Figure SMS_1
Wherein H is i 、W i Respectively is a characteristic diagram F i I=2, 3,4, D is the embedding dimension, which is the same as the context information dimension formed by the enhanced context embedding RCE;
s4.2, generating context information:
directly using the characteristic map F of the encoder output i I= (2, 3, 4), the channel is first compressed by a convolution Conv with a convolution kernel size of 1 x 1; next, F is carried out 2 、F 4 Up-sampling and down-sampling to F, respectively 3 Is formed at this time by the size of F' 2 、F' 3 、F' 4 Is passed through a convolution Conv with a convolution kernel size of 1×1 and performs a Flatten operation to form context information
Figure SMS_2
Wherein D is the embedding dimension;
s4.3, the fusion module comprises three parts of an injector, an extractor and a cross-window attention module Swin Block, wherein the injector and the extractor are a spatial feature injector and a multi-scale feature extractor in a ViT Adapter, the injector fuses the feature attention of spatial information into context information, the extractor gives the feature attention of the context information into the spatial information, and the cross-window attention module is used for realizing a cross-window attention mechanism;
the space information and the context information are generated to obtain F after passing through a fusion module 2 * 、F 3 * 、F 4 *
S4.4F obtained in the step S4.3 2 * 、F 3 * 、F 4 * And F 1 After the cross-layer fusion module CLGD, F is obtained 1 *
S4.5, finally, F 1 * 、F 2 * 、F 3 * 、F 4 * Sending the semantic segmentation result to an UuperNet decoder without a pooling pyramid module, and outputting the final semantic segmentation result.
Further, the specific process of step S2 is as follows:
s2.1, normalization: image F of RGB 0 Normalizing the values of the three channels, namely F 0 n =F 0 /255=[F R ,F G ,F B ]/255, where F R ,F G ,F B The size of the (C) is 1024 multiplied by 2048;
s2.2, standardization: normalization in terms of the R, G, B three-channel dimension, i.e. F 0 ns =(F 0 n -mean)/std, wherein mean= [0.485,0.456,0.406],std=[0.229,0.224,0.225]。
The present invention also provides a computer readable storage medium having stored therein a computer program which when executed by a processor implements the above method.
The invention also provides a computer device comprising a processor and a memory for storing a computer program; the processor is configured to implement the above-described method when executing the computer program.
The invention has the beneficial effects that: in the method, the decoder structure is optimized, the RCE is utilized to generate rich context information from the feature pyramid, the representation capability of the model is improved, and the cross-layer Fusion expansion in the ViT Adapter encoder is applied to the decoder through Fusion Block, so that the interactive Fusion of the context information and the space information is realized, and the semantic segmentation effect is improved.
Drawings
FIG. 1 is a schematic diagram of the overall architecture of a method implementation of embodiment 1 of the present invention;
FIG. 2 is a diagram showing the operation of RCE in example 1 of the present invention;
FIG. 3 is a schematic diagram of the structure of the Vit Adapter encoder and the fusion block;
fig. 4 is a schematic view of the feature map visualization in embodiment 2 of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that, while the present embodiment provides a detailed implementation and a specific operation process on the premise of the present technical solution, the protection scope of the present invention is not limited to the present embodiment.
The embodiment provides an image semantic segmentation method based on a pyramid cross-layer fusion decoder, as shown in fig. 1, comprising the following specific processes:
s1, inputting an image: the image is RGB image F 0 ,F 0 The size and shape of the digital representation is 3×1024×2048; wherein 3 is three channels of R (red), G (green) and B (blue), 1024×2048 is the width and height of the image;
s2, data preprocessing:
s2.1, normalization: to facilitate model extraction of features, RGB image F is required 0 The values on the three channels are normalized (normalized), i.e. F 0 n =F 0 /255=[F R ,F G ,F B ]/255, where F R ,F G ,F B The size of the (C) is 1024 multiplied by 2048;
s2.2, standardization: in order to make the model converge faster, the present embodiment performs standardization (standardization) according to R, G, B three channel dimension directions, i.e. F 0 ns =(F 0 n -mean)/std, whichMiddle mean= [0.485,0.456,0.406 ]],std=[0.229,0.224,0.225];
S3, sending the image processed in the step S2 into an encoder to generate an original characteristic pyramid F 1 、F 2 、F 3 、F 4 . In this embodiment, swin-Larger is used as the encoder.
S4, sending the original feature pyramid obtained in the step S2.4 into a pyramid cross-layer fusion decoder; in the pyramid cross-layer fusion decoder, an original feature pyramid is embedded into an RCE (remote control) through an enhanced context and a fusion module fusion block to generate a feature pyramid F 1 * 、F 2 * 、F 3 * 、F 4 * The method comprises the steps of carrying out a first treatment on the surface of the Then F 1 * 、F 2 * 、F 3 * 、F 4 * Sending the FCFPN to output a final semantic segmentation result; wherein FCFPN is UperNet decoder without pooling pyramid module (Pooling Pyramid Module, PPM).
Specifically, if an object in an image scene is classified into N classes, the output size o= [ O ] of the model 0 ,O 1 ,...,O N-1 ]Wherein O is i Is 1024X 2048 in size, O i Representing image position H i ,W i ]The probability of the ith class is output. Select [ H ] i ,W i ]The channel index at the maximum probability value is used as the class label to which the pixel belongs, and all the class labels belonging to the same class label are marked with the same color.
As shown in fig. 1, enhanced context embedding (RCE, reinforce Context Embedding) is used to obtain semantic information (F) of a feature map c ) The method comprises the steps of carrying out a first treatment on the surface of the Fusion module Fusion Block is used for fusing context information (Context Information) and spatial information (spacial information); CLGD [20] For semantic information F c * And F 1 Is a cross-layer fusion module of (2); FCFPN is UperNet decoder without pooling pyramid module PPM.
In this embodiment, the specific process of step S4 is as follows:
s4.1, constructing space information. Construction method of space information and visual transducer adaptationThe spatial prior module (Spacial Prior Module) in the Adapter ViT Adapter is similar, except that the method of the present embodiment is based on the spatial characteristics of the encoder itself (i.e. F 2 、F 3 And F 4 ) As spatial information, i.e.
Figure SMS_3
Wherein H is i 、W i Respectively is a characteristic diagram F i (i=2, 3, 4), D is the Embedding dimension (Embedding Dim), which is the value with the enhanced context Embedding RCE formed context information F c The dimensions are the same.
S4.2, generating context information.
Since the model size limitation of the UperNet decoder is limited by the number of feature channels of the feature pyramid of the encoder, the method of the embodiment performs feature compression on the features generated by the encoder through the embedding dimension.
Specifically, to enhance the characterizability of the model, the method of this embodiment compresses the original feature pyramid by RCE to form context information, as shown in fig. 2. Because the traditional convolutional neural network structure has the Patch Embedding (Patch Embedding) function, namely, the information extraction of the network under the same scale is realized through convolution and downsampling, the characteristic output of the convolutional neural network can be directly utilized as the multi-scale Patch Embedding. Furthermore, the number of channels is controlled by the embedding dimension.
More specifically, the context information is formed by RCE by directly using the characteristic diagram F of the encoder output as shown in FIG. 2 i I= (2, 3, 4), the channel is first compressed by a convolution Conv with a convolution kernel size of 1 x 1; then, to be able to splice in the channel dimension, F 2 、F 4 Up-sampling and down-sampling to F, respectively 3 Is formed at this time by the size of F' 2 、F' 3 、F' 4 Is passed through a convolution Conv with a convolution kernel size of 1×1 and performs a Flatten operation to form context information
Figure SMS_4
Wherein D is an inlayDimension into.
S4.3, the context information F has been obtained by steps S4.1 and S4.2 c And spatial information F sp Next, interaction of the two information is to be achieved. In this embodiment, the design concept of the Fusion Block derives from the Cross attribute concept in the ViT Adapter, unlike the ViT Adapter, which is shown in fig. 3 (a), the ViT Adapter uses ViT as a main body, but multiple ViT blocks using ViT as a main body can cause an increase in computational complexity of the model, so that in order to reduce the complexity of the model, the method in this embodiment, when designing the Fusion Block of the decoder, the correction of semantic information is realized only by one ViT Block, and in order to realize the Attention of a Cross window, the ViT Block is replaced by the Swin Block, so that the problem that the Attention mechanism is difficult to be introduced on a single scale due to the size problem between layers of the feature pyramid is solved.
Specifically, the Fusion Block has a structure as shown in fig. 3 (b), and comprises three parts, namely an Injector, an Extractor and a cross-window attention module Swin Block, wherein the Injector and the Extractor are a spatial feature Injector Spatial Feature Injector and a Multi-scale feature Extractor Multi-Scale Feature Extractor in a ViT Adapter, the Injector fuses the feature attention of spatial information into context information, and the Extractor imparts the feature attention of the context information into the spatial information, so that deep features of a feature pyramid can better act on shallow features; the cross-window attention module Swin Block is used for realizing a cross-window attention mechanism.
The spatial information and the context information are generated to obtain F after Fusion Block 2 * 、F 3 * 、F 4 *
S4.4F obtained in the step S4.3 2 * 、F 3 * 、F 4 * And F 1 After the cross-layer fusion module CLGD, F is obtained 1 *
S4.5, finally, F 1 * 、F 2 * 、F 3 * 、F 4 * Sending the semantic segmentation result to an UuperNet decoder without a pooling pyramid module, and outputting the final semantic segmentation result.
Example 2
To demonstrate the effectiveness of the method of example 1, this example is directed to the Cityscapes dataset [12] The following ablation experiments were performed above:
1. evaluation criteria of model
In this embodiment, the model is evaluated by three criteria: average cross ratio Mean Intersection over union (Mean IoU), floating point operand floating point operations (FLOPs), parameter Parameters (Param)
①Mean IoU:
TP: in fact, is a positive sample, and the model prediction result is the number of samples of the positive sample;
TN: in fact, is a negative sample, and the model prediction result is the number of samples of the negative sample;
FP: in fact a negative sample, but the model prediction result is the number of samples of the positive sample;
FN: in fact positive samples, but the model prediction result is the number of samples of negative samples; so that
Figure SMS_5
And Mean IoU is the average value of all the category cross ratios in the data set. The larger the value, the higher the accuracy of the model.
(2) FLOPs: referring to floating point operands, it can be understood as the amount of computation used to measure the computational complexity of the model. The larger the value, the larger the calculation amount of the model.
(3) Param: the reference number is the number of the learnable parameters of a model and is used for measuring the size of the model. The larger the value, the more space the model occupies.
2. Ablation experiments of RCE
Example 1 method enriched semantic information is obtained by RCE. To demonstrate the role of RCE, this example devised a fusion strategy of different layers, as shown in table 1.
TABLE 1
Figure SMS_6
Figure SMS_7
Table 1 shows the results of an RCE ablation experiment on a Cityscapes validation set, where RCE refers to the backhaul network output layer F involved in RCE i
As shown in table 1, RCE significantly improved the accuracy of segmentation. F compared to FCFPN 2 、F 3 And F 4 The accuracy as input to the RCE feature is ultimately improved by 1.97% (Res 50) and 1.91 (Res 101). It can be seen that F will be found without significantly increasing the amount of model calculation and the amount of parameters 2 、F 3 And F 4 As an ideal choice for an Embedding Layer. For semantic segmentation, context Information is the key of feature map information representation, and feature maps under different scales can pay attention to different context information, so that the segmentation accuracy of a model is improved. After that, RCE defaults to F 2 、F 3 And F 4 As an assembled Layer.
3. Ablation experiments of Fusion Block (FB)
TABLE 2
Figure SMS_8
Table 2 shows the results of Fusion Block (FB) and RCE ablation experiments on the Cityscapes dataset, the embodiment selects the deepest feature F due to the model nature of FB itself, i.e., the two-dimensional feature map must be serialized by a Patch Embedding 4 As input to the Patch encoding.
As shown in table 2, the present embodiment still uses FCFPN as a baseline (baseline), and first, in order to determine the influence of FB and RCE on model accuracy, the above experiments were performed separately, and regarding FB, there was a 5.71% improvement and a 4.85% improvement in each case under the conditions that the backspan is Res50 and Res101, respectively, compared with FCFPN; while using FB and RCE at the same time, there was a 6.09% and 5.24% improvement, respectively, to illustrate the impact of FB and RCE on model accuracy.
4. Super parameter setting of model
Because different coding Dim and head are set in the ViT Adapter, different-scale encoders are designed. Based on this idea, this embodiment also sets decoder settings in five different modes to investigate the performance of a pyramid cross-layer fusion decoder (hereinafter referred to as PCFD) in different modes. As shown in table 3 below, PCFD was divided into five modes, tiny, small, base and large, respectively, and in the previous ablation experiments, the setting adopted by default in this example was the tini mode.
TABLE 3 Table 3
Figure SMS_9
Table 3 shows the super parameter settings of PCFD; the Embedding Dim represents feature dimensions (feature channel number) of the Context information and the spatial information, and Head (Space, context, swin) represents the number of attention heads in the Injector, the Extractor, and the cross-window attention module Swin Block.
TABLE 4 Table 4
Figure SMS_10
Figure SMS_11
Table 4 shows the experimental results on the Cityscapes dataset for different modes, where
Figure SMS_12
Indicating that the model adopts OHEM during training [17] Is provided.
From the experimental results, PCFD does not have obvious improvement on model precision along with the improvement of model modes. Only the tiny mode is selected as the default model hyper-parameter configuration in subsequent experiments.
Example 3
To demonstrate the advantages of PCFD, this example was directed to the following comparative experiments: the experiment was still performed on the Cityscapes dataset.
TABLE 5
Figure SMS_13
/>
Figure SMS_14
Table 5 shows a comparison over the Cityscapes dataset, where FLOPs are floating point operands of the model at the same input size, representing the manner in which the model structure is trimmed to the training of the other dataset,
Figure SMS_15
the model adopts OHEM during training [26] Is provided. # denotes that the crop_size of the model at training is 896×896.OM denotes Out of Memory.
Table 5 shows the results of the most advanced method on Cityscapes. The two groups are divided into a first group of tests with CNN as a backup, and a second group of tests with Tranformer as a backup. In this dataset, PCFD is superior to other methods, both in terms of model parameters, floating point calculations, and in terms of model accuracy, when the standard Resnet is chosen as representative of CNN. While when Swin-L and ViT-Adapter-L are used as backbones, PCFD is 0.8% different from the best model accuracy (mIoU), but the parameter quantity and floating point calculation quantity of PCFD are reduced by 23% and 68% respectively compared with ViT-adapt-UuperNet.
Feature visualization: because of the different nature of the different encoders themselves, we divide the encoders into three classes, CNN, viT Adapter, respectively, whereas PFD functions slightly differently in these three classes.
As shown in FIG. 4, this embodiment visualizes three types of knitting ResNet, swin transducer and Vit AdapterCharacteristic diagram F of encoder i And a feature map F obtained by RCE and Fusion Block respectively i * Wherein i is [1,4 ]]In which ResNet101, swin Larget and ViT Adapter are taken as examples, respectively. From the figure, it can be seen that there is a great difference in the feature map visualizations of ResNet101 and Swin Larget, the root cause of which is the most recent work [28] Has already been mentioned.
But both are for F 1 After feature fusion, the edges of the segmented object and the texture inside the outline of the object can be clearly mapped.
In R101, since the spatial local feature is the focus of the convolutional network, compression of the feature channel, introduction of the attention mechanism, and fusion of the deep features and the shallow features are achieved after passing through the PCFD.
In the Swin Transformer, since the Transformer focuses on the similarity of global features, semantic distinction of shallow features is enhanced by cross-layer fusion of context information and spatial information obtained through RCE after PCFD. The effect of PCFD in Swin transformers is therefore the compression of the characteristic channels, the fusion of deep features with shallow features.
In the Vit Adapter, the present embodiment selects a portion of the original image as the input of the model due to the characteristics of the Vit Adapter itself (the width and height of the input image must be uniform), and scales the feature map to the same size as the R101-PCFD and Swin-PCFD feature maps. From the visual characteristic diagram, the characteristic distinction between different classifications increases after PCFD, and the characteristic is not greatly lost after characteristic compression, the shallow characteristic F 2 Deep context information is fused as well.
Various modifications and variations of the present invention will be apparent to those skilled in the art in light of the foregoing teachings and are intended to be included within the scope of the following claims.

Claims (4)

1. The image semantic segmentation method based on the pyramid cross-layer fusion decoder is characterized by comprising the following specific steps of:
s1, inputting an image;
s2, preprocessing data;
s3, sending the image processed in the step S2 into an encoder to generate an original characteristic pyramid F 1 、F 2 、F 3 、F 4
S4, sending the original feature pyramid obtained in the step S2.4 into a pyramid cross-layer fusion decoder; in a pyramid cross-layer Fusion decoder, an original feature pyramid is embedded into an RCE (remote control entity) and Fusion module Fusion block through an enhanced context to generate a feature pyramid F 1 * 、F 2 * 、F 3 * 、F 4 * The method comprises the steps of carrying out a first treatment on the surface of the Then F 1 * 、F 2 * 、F 3 * 、F 4 * Sending the FCFPN to output a final semantic segmentation result; wherein the FCFPN is an UuperNet decoder without a pooling pyramid module; the specific process is as follows:
s4.1, constructing space information: by spatial features inherent in the encoder itself, i.e. F 2 、F 3 And F 4 As spatial information, i.e.
Figure FDA0004097516940000011
Wherein H is i 、W i Respectively is a characteristic diagram F i I=2, 3,4, D is the embedding dimension, which is the same as the context information dimension formed by the enhanced context embedding RCE;
s4.2, generating context information:
directly using the characteristic map F of the encoder output i I= (2, 3, 4), the channel is first compressed by a convolution Conv with a convolution kernel size of 1 x 1; next, F is carried out 2 、F 4 Up-sampling and down-sampling to F, respectively 3 Is formed at this time by the size of F' 2 、F′ 3 、F′ 4 Is passed through a convolution Conv with a convolution kernel size of 1×1 and performs a Flatten operation to form context information
Figure FDA0004097516940000021
Wherein D is the embedding dimension;
s4.3, the fusion module comprises three parts of an injector, an extractor and a cross-window attention module Swin Block, wherein the injector and the extractor are a spatial feature injector and a multi-scale feature extractor in a ViT Adapter, the injector fuses the feature attention of spatial information into context information, the extractor gives the feature attention of the context information into the spatial information, and the cross-window attention module is used for realizing a cross-window attention mechanism;
the space information and the context information are generated to obtain F after passing through a fusion module 2 * 、F 3 * 、F 4 *
S4.4F obtained in the step S4.3 2 * 、F 3 * 、F 4 * And F 1 After the cross-layer fusion module CLGD, F is obtained 1 *
S4.5, finally, F 1 * 、F 2 * 、F 3 * 、F 4 * Sending the semantic segmentation result to an UuperNet decoder without a pooling pyramid module, and outputting the final semantic segmentation result.
2. The method according to claim 1, wherein the specific process of step S2 is:
s2.1, normalization: image F of RGB 0 Normalizing the values of the three channels, namely F 0 n =F 0 /255=[F R ,F G ,F B ]/255, where F R ,F G ,F B The size of the (C) is 1024 multiplied by 2048;
s2.2, standardization: normalization in terms of the R, G, B three-channel dimension, i.e. F 0 ns =(F 0 n -mean)/std, wherein mean= [0.485,0.456,0.406],std=[0.229,0.224,0.225]。
3. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-2.
4. A computer device comprising a processor and a memory, the memory for storing a computer program; the processor being adapted to implement the method of any of claims 1-2 when the computer program is executed.
CN202310169764.2A 2023-02-27 2023-02-27 Pyramid cross-layer fusion decoder based on semantic segmentation Pending CN116310324A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310169764.2A CN116310324A (en) 2023-02-27 2023-02-27 Pyramid cross-layer fusion decoder based on semantic segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310169764.2A CN116310324A (en) 2023-02-27 2023-02-27 Pyramid cross-layer fusion decoder based on semantic segmentation

Publications (1)

Publication Number Publication Date
CN116310324A true CN116310324A (en) 2023-06-23

Family

ID=86823329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310169764.2A Pending CN116310324A (en) 2023-02-27 2023-02-27 Pyramid cross-layer fusion decoder based on semantic segmentation

Country Status (1)

Country Link
CN (1) CN116310324A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152441A (en) * 2023-10-19 2023-12-01 中国科学院空间应用工程与技术中心 Biological image instance segmentation method based on cross-scale decoding

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152441A (en) * 2023-10-19 2023-12-01 中国科学院空间应用工程与技术中心 Biological image instance segmentation method based on cross-scale decoding
CN117152441B (en) * 2023-10-19 2024-05-07 中国科学院空间应用工程与技术中心 Biological image instance segmentation method based on cross-scale decoding

Similar Documents

Publication Publication Date Title
Hafiz et al. A survey on instance segmentation: state of the art
Sun et al. Deep RGB-D saliency detection with depth-sensitive attention and automatic multi-modal fusion
Zhou et al. Specificity-preserving RGB-D saliency detection
Hu et al. SAC-Net: Spatial attenuation context for salient object detection
Ju et al. A simple and efficient network for small target detection
CN111062395B (en) Real-time video semantic segmentation method
CN112634296A (en) RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN113052755A (en) High-resolution image intelligent matting method based on deep learning
CN110363068A (en) A kind of high-resolution pedestrian image generation method based on multiple dimensioned circulation production confrontation network
Xu et al. RGB-T salient object detection via CNN feature and result saliency map fusion
CN116310324A (en) Pyramid cross-layer fusion decoder based on semantic segmentation
Yu et al. WaterHRNet: A multibranch hierarchical attentive network for water body extraction with remote sensing images
Zheng et al. Feature pyramid of bi-directional stepped concatenation for small object detection
CN113705575B (en) Image segmentation method, device, equipment and storage medium
Gao et al. MLTDNet: an efficient multi-level transformer network for single image deraining
Cheng et al. A survey on image semantic segmentation using deep learning techniques
Zhang et al. Underwater target detection algorithm based on improved YOLOv4 with SemiDSConv and FIoU loss function
Wang et al. Multi-scale dense and attention mechanism for image semantic segmentation based on improved DeepLabv3+
Huang et al. DeeptransMap: a considerably deep transmission estimation network for single image dehazing
Qiao et al. Depth super-resolution from explicit and implicit high-frequency features
Zou et al. Diffcr: A fast conditional diffusion framework for cloud removal from optical satellite images
Zhu et al. HDRD-Net: High-resolution detail-recovering image deraining network
Wang et al. A Novel Neural Network Based on Transformer for Polyp Image Segmentation
Zhang et al. Se-dcgan: a new method of semantic image restoration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination