NL2025236B1

NL2025236B1 - A semantic segmentation architecture

Info

Publication number: NL2025236B1
Application number: NL2025236A
Authority: NL
Inventors: Marzban Shabbir; Zonooz Bahram; Arani Elahe; Pata Andrei
Original assignee: Navinfo Europe B V
Priority date: 2019-11-29
Filing date: 2020-03-30
Publication date: 2021-08-31

Abstract

A. semantic segmentation architecture comprising an asymmetric encoder—decoder structure, wherein the architecture 5 comprises further an adapter for linking different stages of the encoder and the decoder. The adapter amalgamates information from both the encoder and the decoder for preserving and refining information between multiple levels of the encoder and decoder. In this way the adapter aggregates 10 features from different levels and intermediates between encoder and decoder. A. semantic segmentation architecture comprising an asymmetric encoder—decoder structure, wherein the architecture comprises further an adapter for linking different stages of 15 the encoder and the decoder. The adapter amalgamates information from both the encoder and the decoder for preserving and refining information between multiple levels of the encoder and decoder. In this way the adapter aggregates features from different levels and intermediates between 20 encoder and decoder.

Description

A semantic segmentation architecture The invention relates to a semantic segmentation architecture comprising an asymmetric encoder-decoder structure.

The invention further relates to a method of progressive resizing as used in segmentation to reduce training time.

Essentially the invention relates to a convolutional neural network architecture. Convolutional neural networks (CNNs) have brought about a paradigm shift in the field of computer vision, leading to tremendous advances in many tasks [see lit. 10, 11, 13, 14, 16, 25, 29].

Semantic segmentation, which associates each pixel to an object class it belongs to, is a computationally expensive task in computer vision [see lit. 17]. Fast semantic segmentation is broadly applied to several real-time applications including autonomous driving, medical imaging and robotics [see lit. 18, 19, 24, 26]. Accurate CNN-based semantic segmentation requires larger neural networks with deeper and wider layers. These larger networks are therefore not suitable for edge computing devices as they are cumbersome and require substantial resources.

Down-sampling operations, such as pooling and convolutions with stride greater than one, can help decrease the latency of deeper neural networks, however they result in decreased pixel-level accuracy due to the lower resolutions at deeper levels. Many recent approaches employ either encoder-decoder structure [see lit. 1, 23, 281, a two or multi-branch architecture [see lit. 21, 31, 33] or dilated convolutions [see lit. 3-5, 34] to recover spatial information. While these real-time architectures perform appropriately on simple datasets, their performance is sub-optimal for complex datasets possessing more variability in terms of classes, sizes, and shapes. Thus, there is a significant interest in designing CNN architectures that can perform well on complex datasets and, at the same time, are mobile enough to be of

— 2 _ practical use in real-time applications.

The invention has as an object to answer to this need to provide a practical solution which works in real time situations.

The semantic segmentation architecture of the invention which is based on an asymmetric encoder decoder structure, has the features of one or more of the appended claims.

First and primarily the architecture comprises an adapter for linking different stages of the encoder and the decoder. The adaptor utilizes features at different abstraction levels from both the encoder and decoder to improve the feature refinement at a given level allowing the network to preserve deeper level features with higher spatial resolution. Furthermore, the adaptor enables a better gradient flow from deeper layers to shallower layers by adding shorter paths. While training, gradients of loss with respect to weights are calculated in a backword propagation progression starting from outer most layers to inner layers of convolutional neural networks following a path. This propagation can be termed as gradient flow. The reference to ‘better gradient flow’ means the flow of gradient is more direct with shorter paths.

A feature of the semantic segmentation architecture of the invention is therefore that the adapter amalgamates information from both the encoder and the decoder for preserving and refining information between multiple levels of the encoder and decoder.

More specifically the adapter aggregates {features from three different levels and intermediates between encoder and decoder. On a mathematical level the function of the adapter can be expressed as x= D(T(X® 51) + T(x) + U(x? 512) where superscripts a, e, and d denote adaptor, encoder, and decoder respectively, s represents the spatial level in the network, D(:) and U(:) are downsampling and upsampling functions, and T{:) is a transfer function that reduces the

- 3 = number of output channels from an encoder block and transfers them to the adaptor. The invention is also embodied in a method of progressive resizing as used in segmentation of images to reduce training time, wherein the training starts with initial image sizes followed by a progressive increase of said sizes until a final stage of training is conducted using the original image sizes, applying label relaxation of borders in the images. In the invention first one-hot labels are created from a label map followed by a max-pool operation with stride

1. This effectively dilates each one-hot label channel transforming it into multi-hot labels along the borders which can then be used to find union of labels along the border pixels.

The invention will hereinafter be further elucidated with reference to the drawing. In the drawing: — figure 1 shows some schematic illustrations of prior art semantic segmentation architectures; — figure 2 shows a schematic illustration of the semantic segmentation architecture of the invention; — figure 3 shows the architecture of the adapter module; — figure 4 shows semantic segmentation results on a Mapillary Vistas validation set; and — figure 5 shows training profiles of progressive resizing with and without label relaxation.

Whenever in the figures the same reference numerals are applied, these numerals refer to the same parts. Whenever in the following reference is made to RGPNet, this refers to the architecture according to the invention.

In figure 1 different semantic segmentation architectures are shown, notably: (a) In context-based networks, dilated convolutions with multiple dilation rates are employed in cascade or in parallel to capture a multi- scale context. To capture the contextual information at multiple scales, DeepLabV2 [lit. 3] and DeeplabV3 [lit. 4] exploit multiple parallel atrous convolutions with

- 4 - different dilation rates, while PSPNet [lit. 34] performs multi-scale spatial pooling operations. Although these methods encode rich contextual information, they cannot capture boundary details effectively due to strided convolution or pooling operations [lit. ©].

(b) In encoder-decoder networks, an encoder extracts the features of high-level semantic meaning and a decoder densifies the features learned by the encoder. Several studies entail encoder-decoder structure [lit. 1, 7, 9, 15, 20, 23, 37]. The encoder extracts global contextual information and the decoder recovers the spatial information. Deeplabv3+ [lit. 6] utilizes an encoder to extract rich contextual information in conjunction with a decoder to retrieve the missing object boundary details.

However, implementation of dilated convolution at higher dilation rates is computationally intensive making them as yet unsuitable for real-time applications.

{c) In attention-based networks, the feature at each position is selectively aggregated by a weighted sum of the features at all positions. This can be done across channels or spatial dimensions. Attention mechanisms, which help networks to focus on relevant information and ignore the irrelevant information, have been widely used in different tasks, and gained popularity to boost the performance of semantic segmentation. Wang et al. [lit. 30] formalized self-attention by calculating the correlation matrix between each spatial point in the feature maps in video sequences. To capture contextual information, DaNet [lit. 8] and OCNet [lit. 32] apply a self-attention mechanism. DaNet has dual attention modules on position and channels to integrate local features with their respective global dependencies. OCNet, on the other hand, employs the self-attention mechanism to learn the object context map recording the similarities between all the pixels and the associated pixel. PSANet [lit. 35] learns to aggregate contextual information for each individual position via a predicted attention map. Attention based models, however, generally require expensive computation.

— 5 = (d) Multi-branch networks are employed to combine semantic segmentation results at multiple resolution levels. The lower resolution branches yield deeper features with reduced resolution and the higher resolution branches learn spatial details. Another approach to preserve the spatial information is to employ two- or multi-branch approach. The deeper branches extract the contextual information by enlarging receptive fields and shallower branches retain the spatial details. The parallel structure of these networks make them suitable for run time efficient implementations [Lit. 22, 31, 33]. However, they are mostly applicable to the relatively simpler datasets with fewer number of classes. On the other end, HRNet [lit. 27] proposed a model with fully connected links between output maps of different resolutions. This allows the network to generalize better due to multiple paths, acting as ensembles. However, without reduction of spatial dimensions of features, the computational overhead is very high and makes the model no longer feasible for real-time usage.

Figure 2 depicts the architecture of the invention. Rectangular boxes depict a tensor at a given level with the number of channels mentioned as their labels. The arrows represent the convolution operations indicated by the legend.

Figure 3 depicts the operation of a single adapter module at a level in the midst of the decoder and encoder. The adaptor fuses information from multiple abstraction levels; T(:), D(:), and U(:) denote the transfer, downsampling and upsampling functions, respectively. F(:) is the decoder block with shared weights between layers.

As is shown in figure 2 and figure 3, the architecture of the invention is based on a light-weight asymmetric encoder-decoder structure for fast and efficient inference. It comprises three components: an encoder which extracts high-level semantic features, a light asymmetric decoder, and an adaptor which links different stages of encoder and decoder. The encoder decreases the resolution and increases the number of feature maps in the deeper layers, thus it extracts more abstract features in deeper layers with

— 6 _ enlarged receptive fields.

The decoder reconstructs the lost spatial information. The adaptor amalgamates the information from both encoder and decoder allowing the network to preserve and refine the information between multiple levels.

With particular reference to Figure 2, in a given row of the diagram, all the tensors have the same spatial resolution with the number of channels mentioned in the scheme.

Four level outputs from the encoder are extracted at different spatial resolutions 1=4, 1=8, 1=16 and 1=32 with 256, 512, 1024 and 2048 channels, respectively. The number of channels are reduced by a factor of four using 1 X 1 convolutions followed by batch norm and ReLU activation function at each level. These outputs are then passed through a decoder structure with the adaptor in the middle. Finally, a segmentation output is extracted from the largest resolution via 1 X 1 convolution to match the number of channels to segmentation categories.

The adaptor acts as a feature refinement module. The presence of an adaptor precludes the need of a symmetrical encoder-decoder structure. It aggregates the features from three different levels, and intermediates between encoder and decoder as is shown in figure 3. The adaptor function is defined as: x&= D(T(X® 5-1) + T(x%) + U(x? 541) (1) where superscripts a, e, and d denote adaptor, encoder, and decoder respectively, s represents the spatial level in the network. D(:) and U(:) are downsampling and upsampling functions. Downsampling is carried out by convolution with stride 2 and upsampling is carried out by deconvolution with stride 2 matching spatial resolution as well as the number of channels in the current level. T(:) is a transfer function that reduces the number of output channels from an encoder block and transfers them to the adaptor: T(x) = o(w? Q xs+ b?) (2) where © and b are the weight matrix and bias vector,

— 7 — denotes the convolution operation, and &X denotes the activation function.

The decoder contains a modified basic residual block, F, where we use shared weights within the block.

The decoder function is as follows:

x= FX; wd ) (3) The adaptor has a number of advantages.

First, the adaptor aggregates features from different contextual and spatial levels.

Second, it facilitates the flow of gradients from deeper layers to shallower layers by introducing a shorter path.

Third, the adaptor allows for utilizing asymmetric design with a light-weight decoder.

This results in fewer convolution layers, further boosting the flow of gradients.

The adaptor, therefore, makes the network suitable for real- time applications as it provides rich semantic information while preserving the spatial information.

Progressive Resizing with Label Relaxations As mentioned above the invention is also embodied in a method of progressive resizing as used in segmentation to reduce training time.

Conventionally the training starts with smaller image sizes followed by a progressive increase of size until the final stage of the training is conducted using the original image size.

For instance, this technique can theoretically speed up the training time by 16 times per epoch if the image dimensions are decreased by 1/4 and correspondingly the batch size is increased by a factor of 16 in a single iteration.

However, reducing the image size using nearest neighbour interpolation (bi-linear or bi-cubic interpolation are not applicable), introduces noise around the borders of the objects due to aliasing.

Note that inaccurate labelling is another source of noise.

To reduce effects of boundary artifacts in progressive resizing, the invention applies an optimized variant of label relaxation method [lit 36]. In label relaxation along the borders, instead of maximizing likelihood of a target label, likelihood of union

— 8 — of neighbouring pixel labels is maximized. In the invention, first one-hot labels are created from the label map followed by max-pool operation with stride 1. This effectively dilates each one-hot label channel transforming it into multi-hot labels along the borders which can then be used to find union of labels along the border pixels.

The kernel size of the max pooling controls the width containing pixels being treated as border pixels along the borders. Loss at a given border pixel can be calculated as follows where N is set of border labels: Lboundary = -log ZceN (P(C)) (4) Figure 4 depicts some qualitative results obtained by the model of the invention compared to TASCNet and BiSeNet on a Mapillary Vistas validation set. The Mapillary Vistas validation set consists of 20,000 high-resolution street-level images taken from many different locations around the globe and under varying conditions annotated for 65 categories. The dataset is split up in a training set of 18,000 images and a validation set of 2,000 images.

The columns correspond to input image, the output of RGPNet, the output of TASCNet [lit. 15], the output of BiSeNet [lit. 31], and the ground-truth annotation. For all methods R101 is used as the backbone. RGPNet mainly improves the results on road and road-related objects’ pixels.

Experimental results RGPNet In this section, overall performance of RGPNet is evaluated and compared with other real-time semantic segmentation methods (BiSeNet [lit. 31], TASCNet [lit. 15], and ShelfNet [lit. 37]) on the Mapillary validation set. Different feature extractor backbones ResNet [lit. 12] (R101, R50 and R18), Wide-Resnet [lit. 38] (WRN38), and HarDNet [lit.

2] (HarDNet39D} are used.

— g — BiSeNet (R101) 15.5 20.4 50.1 TASCNet (R50) 17.6 46.4 32.8 TASCNet (R101) 13.9 48.8 51.8 ShelfNet (R101) 14.8 49.2 57.7 RGPNet (R101) 18.2 50.2 52.2 RGPNetB (WRN38) 5.72 53.1 215 RGPNet (R18) 54.4 41.7 17.8 Table 1. Mapillary Vistas validation set results. The experiments are conducted using 16-bit floating point (FP16) numbers.

Table 1 compares speed (FPS), mIoU and number of parameters on these methods on 16-bit precision computation. RGPNet (R101) achieves 50:2% mIoU which outperforms TASCNet and ShelfNet with a significant margin and lower latency. Although RGPNet (R101) has more parameters than the TASCNet (R101), both speed and mIoU are considerably higher. However, BiSeNet demonstrates poor performance on Mapillary resulting in the lowest mIcU. The method of the invention also achieves impressive results with a lighter encoder (R18 or HarDNet39D) surpassing BiSeNet with a heavy backbone (R101) significantly, 41:7% vs 20:4% mIoU and 54.4 vs

15.5 FPS. Validation progressive resizing with label relaxation In order to validate the gain from label relaxation, the result of progressive resizing training is compared with and without label relaxation. In these experiments for the first 100 epochs, the input images are resized by a factor of 1/4 both in width and height. At the 100th epoch, the image resize factor is set to 1/2 and, at the 130th epoch, full-sized images are used. With label relaxation, it is observed that the model achieves higher mIoU especially at lower resolutions. To further analyze the effect of label relaxation in

— 10 — progressive resizing technique, the difference in entropy is illustrated between two setups (progressive resizing with and without label relaxation). Figure 5 shows that the model trained with label relaxation is more confident in the predictions around the object boundaries.

Figure 5a shows that training profiles of progressive resizing with (red) and without (green) label relaxation experiments conducted on a Cityscapes validation set.

Cityscapes contains diverse street level images {from 50 different city across Germany and France. It contains 30 classes and only 19 classes of them are used for semantic segmentation evaluation. The dataset contains 5000 high quality pixel-level finely annotated images and 20000 coarsely annotated images. The finely annotated 5000 images are divided into 2975=500=1525 image sets for training, validation and testing.

Specially at lower resolutions, label relaxation helps in achieving a higher mIoU.

Figure 5b depicts a heatmap of difference in entropy between label relaxation and without label relaxation based trained model evaluated on a sample image from validation set. On boundaries of objects, models trained with label relaxation are more confident about the label and hence have lower entropy (blue shades).

The method of progressive resizing with label relaxation according to the invention also has beneficial energy implications as shown in table 2 below. raining scheme [Energy [KJ] Time mien (0) eee, vee Table 2. Progressive resizing result on energy efficiency Table 2 shows that the training time reduced from 109 minutes to 32 minutes, close to the speedup expected from theoretical calculation. The energy consumed by GPU decreases by an approximate factor of 4 with little to no drop in the

— 11 — performance.

Although the invention has been discussed in the foregoing with reference to an exemplary embodiment of the architecture of the invention, the invention is not restricted to this particular embodiment which can be varied in many ways without departing from the invention. The discussed exemplary embodiment shall therefore not be used to construe the appended claims strictly in accordance therewith. On the contrary the embodiment is merely intended to explain the wording of the appended claims without intent to limit the claims to this exemplary embodiment. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using this exemplary embodiment. Aspects of the invention are itemized in the following section.

1. A semantic segmentation architecture comprising an asymmetric encoder-decoder structure, characterized in that the asymmetric architecture is enhanced by comprising further an adapter for linking different stages of the encoder and the decoder.

2. The semantic segmentation architecture of claim 1, characterized in that the adapter amalgamates information from both the encoder and the decoder for preserving and refining information between multiple levels of the encoder and decoder, representing different levels of abstraction.

3. The semantic segmentation architecture of claim 1 or 2, characterized in that the adapter aggregates features from three different levels and intermediates between encoder and decoder.

4. The semantic segmentation architecture of claim 3, characterized in that the adapter has the function x= D(T(xE 5.1) + T(xE) + Und 541) where superscripts a, e, and d denote adaptor, encoder, and decoder respectively, s represents the spatial level in the

- 12 — network, D(:) and U(:) are downsampling and upsampling functions, and T{:) is a transfer function that reduces the number of output channels from an encoder block and transfers them to the adaptor.

5. Method of progressive resizing as used in segmentation to reduce training time, wherein the training starts with reduced image sizes followed by a progressive increase of said sizes until a final stage of training is conducted using the original image sizes, applying label relaxation of borders in the images, characterized in that first one-hot labels are created from a label map followed by a max-pool operation with stride 1.

6. Method according to claim 5, characterized in that each one-hot label channel is dilated so as to transform said one-hot label channel into multi-hot labels along the borders which are used to find union of labels along border pixels.

References

[1] Badrinarayanan, V., Kendall, A., and Cipolla, R. (2017).

Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481-2495.

[2] Chao, P., Kao, C.-Y., Ruan, Y.-S., Huang, C.-H., and Lin, Y.-L. (2019). Hardnet: A low memory traffic network. In Proceedings of the IEEE International Conference on Computer Vision, pages 3552-3561.

[3] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2014). Semantic image segmentation with deep convolutional nets and fully connected crfs.

arXiv preprint arXiv:1412.7062.

[4] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017a). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834-

848.

[5] Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation.

- 13 =

[6] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV.

[7] Ding, H., Jiang, X., Shuai, B., Qun Liu, A., and Wang, G.

(2018). Context contrasted feature and gated multi-scale aggregation for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2393-2402,

[8] Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., and Lu, H. {2019a) . Dual attention network for scene segmentation.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3146-3154.

[2] Fu, J., Liu, J., Wang, Y., Zhou, J., Wang, C., and Lu, H.

(2019b). Stacked deconvolutional network for semantic segmentation. IEEE Transactions on Image Processing.

[10] Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440-1448.

[11] He, K., Zhang, X., Ren, S., and Sun, J. (Z2016a). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778.

[12] He, K., Zhang, X., Ren, S., and Sun, J. (2016b). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .

[13] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097-1105.

[14] Lan, X., Zhu, X., and Gong, S. (2018). Person search by multi-scale matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 536-552.

[15] Li, J., Raventos, A., Bhargava, A., Tagawa, T., and Gaiden, A. (2018). Learning to fuse things and stuff.

10 A PREPRINT-NOVEMBER 19, 2019

[16] Li, W., zhu, X., and Gong, S. (2017). Person reidentification by deep joint learning of multi-loss

- 14 — classification. arXiv preprint arXiv:1705.04724.

[17] Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431-3440.

[18] Milioto, A., Lottes, P., and Stachniss, C. (2018). Real- time semantic segmentation of crop and weed for precision agriculture robots leveraging background knowledge in cnns. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2229-2235. IEEE.

[19] Paszke, A., Chaurasia, A., Kim, S., and Culurciello, E. (2016). Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147.

[20] Pohlen, T., Hermans, A., Mathias, M., and Leibe, B. (2017). Full-resolution residual networks for semantic segmentation in street scenes. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Poudel, R. P., Liwicki, S., and Cipolla, R. (201%a) .

Fast-scnn: fast semantic segmentation network. arXiv preprint arXiv:1902.04502.

[22] Poudel, R. P., Liwicki, S., and Cipolla, R. (2019b) . Fast-scnn: fast semantic segmentation network. arXiv preprint arXiv:1902.04502.

[23] Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234-241, Springer.

[24] Salehi, S. S. M., Hashemi, S. R., Velasco-Annis, C., Ouaalam, A., Estroff, J. A., Erdogmus, D., Warfield, S. K., and Gholipour, A. (2018). Real-time automatic fetal brain extraction in fetal mri by deep learning. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pages 720-724. IEEE.

[25] Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[26] Su, Y.-H., Huang, K., and Hannaford, B. (2018). Real-time

— 15 — vision-based surgical tool segmentation with robot kinematics prior. In 2018 International Symposium on Medical Robotics (ISMR), pages 1-6. IEEE.

[27] Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., and Wang, J. (2019). High- resolution representations for labeling pixels and regions. CoRR, abs/1904.04514.

[28] Sun, S., Pang, J., Shi, J., Yi, S., and Ouyang, W. (2018). Fishnet: A versatile backbone for image, region, and pixel level prediction. In Advances in Neural Information Processing Systems, pages 754-764.

[29] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, 5., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A.

(2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1-9.

[30] Wang, X., Girshick, R., Gupta, A., and He, K. (2018). Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794-7803.

[31] Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., and Sang, N. (2018). Bisenet: Bilateral segmentation network for real- time semantic segmentation. Lecture Notes in Computer Science, page 334-349.

[32] Yuan, Y. and Wang, J. (2018). Oenet: Object context network for scene parsing. arXiv preprint arXiv:1809,00916.

A PREPRINT-NOVEMBER 19, 2019

[33] Zhao, H., Qi, X., Shen, X., Shi, J., and Jia, J. (2018a). Icnet for real-time semantic segmentation on high- resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 405-420.

[34] Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017).

Pyramid scene parsing network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Zhao, H., Zhang, Y., Liu, S., Shi, J., Change Loy, C., Lin, D., and Jia, J. (2018b}. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of

- 16 - the European Conference on Computer Vision (ECCV), pages 267-283.

[36] Zhu, Y., Sapra, K., Reda, F. A., Shih, K. J., Newsam, S., Tao, A., and Catanzaro, B. (2018). Improving semantic segmentation via video propagation and label relaxation.

[37] Zhuang, J. and Yang, J. (2018). Shelfnet for real-time semantic segmentation. arXiv preprint arXiv:1811.11254,

[38] Wu, Z., Shen, C., and van den Hengel, A. (2019). Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition, 90:119-133.

Claims

— 17 — CONCLUSIES— 17 — CONCLUSIONS

1. Semantische segmentatie-architectuur omvattende een asymmetrische-encoder-decoder-structuur, met het kenmerk, dat de asymmetrische architectuur wordt versterkt door verder een adapter te bevatten voor het verbinden van verschillende stadia van de encoder en de decoder.A semantic segmentation architecture comprising an asymmetric encoder-decoder structure, characterized in that the asymmetric architecture is enhanced by further comprising an adapter for connecting different stages of the encoder and the decoder.

2. Semantische segmentatie-architectuur volgens conclusie 1, met het kenmerk, dat de adapter informatie van zowel de encoder als de decoder samenvoegt voor het bewaren en verfijnen van informatie tussen meerdere niveaus van de encoder en decoder, die verschillende abstractieniveaus vertegenwoordigen.A semantic segmentation architecture according to claim 1, characterized in that the adapter merges information from both the encoder and the decoder to store and refine information between multiple levels of the encoder and decoder, representing different levels of abstraction.

3. Semantische segmenteringsarchitectuur volgens conclusie 1 of 2, met het kenmerk, dat de adapter aggregaten bevat van drie verschillende niveaus en tussenproducten tussen encoder en decoder.Semantic segmentation architecture according to claim 1 or 2, characterized in that the adapter contains aggregates of three different levels and intermediates between encoder and decoder.

4. Semantische segmentatie-architectuur volgens conclusie 3, met het kenmerk, dat de adapter de functie heeft x2s= D(T(X® 5.1) + T(x%) + U(x? 541) waar superscripts a, e en d respectievelijk adapter, encoder en decoder aangeven, s het ruimtelijke niveau in het netwerk, D{:) en U(:)downsampling- en upsampling-functies zijn, en T(:) een overdrachtsfunctie is die het aantal uitvoerkanalen van een encoderblok vermindert en naar de adapter overbrengt.Semantic segmentation architecture according to claim 3, characterized in that the adapter has the function x2s= D(T(X® 5.1) + T(x%) + U(x? 541) where superscripts a, e and d indicate adapter, encoder and decoder respectively, s is the spatial level in the network, D{:) and U(:) are downsampling and upsampling functions, and T(:) is a transfer function that reduces the number of output channels of an encoder block, and transfers to the adapter.

5. Werkwijze voor het progressief vergroten/verkleinen zoals gebruikt in segmentatie om de trainingstijd te verkorten, waarbij de training begint met verkleinde afbeeldingsformaten gevolgd door een geleidelijke toename van genoemde formaten tot een laatste trainingsfase wordt uitgevoerd met gebruikmaking van de originele afbeeldingsformaten, waarbij de randen van de afbeeldingen worden gelabeld, met het kenmerk, dat eerst één-hot labels worden gemaakt op basis van een labelkaart gevolgd door een max-pool bewerking met stride 1.5. Process of progressive resizing as used in segmentation to reduce training time, in which training starts with reduced image sizes followed by a gradual increase of said sizes until a final training phase is performed using the original image sizes, where the borders of the images are tagged, characterized in that first one-hot labels are created from a label map followed by a max-pool operation with stride 1.

6. Werkwijze volgens conclusie 5, met het kenmerk, dat elk één-hot label-kanaal is verwijd om het één-hot-label-A method according to claim 5, characterized in that each one-hot-label channel is expanded to accommodate the one-hot-label

- 18 — kanaal te transformeren in multi-hot-labels langs de randen die worden gebruikt om vereniging van labels langs grenspixels te vinden.- 18 — transform channel into multi-hot labels along the edges used to find association of labels along boundary pixels.