CN116543155A

CN116543155A - Semantic segmentation method and device based on context cascading and multi-scale feature refinement

Info

Publication number: CN116543155A
Application number: CN202310508273.6A
Authority: CN
Inventors: 程杰仁; 花帅
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-08-04

Abstract

The invention discloses a semantic segmentation method and a semantic segmentation device based on context cascading and multi-scale feature refinement, wherein the method is applied to a convolutional neural network, the convolutional neural network comprises a semantic segmentation network based on context cascading and multi-scale feature refinement, and the method comprises the following steps: inputting the image into a backbone network for feature coding; inputting the feature images of different receptive fields of different levels into an up-down Wen Jilian module for cascade operation to obtain a multi-scale context information feature image with global features; inputting the feature map of the low-dimensional stage into a multi-scale feature refinement module, obtaining multi-scale space information of the low-dimensional stage through channel segmentation and convolution, obtaining the low-dimensional multi-scale space feature map after attention guidance, carrying out deep fusion on the low-dimensional multi-scale space feature map and the multi-scale context information feature map with global features, and realizing prediction of the feature map through upsampling. The invention realizes better balanced segmentation accuracy and reasoning speed when multi-scale context information is mined and space details are refined on a resource limited platform.

Description

Semantic segmentation method and device based on context cascading and multi-scale feature refinement

Technical Field

The invention relates to the field of computer vision, in particular to a semantic segmentation method and device based on context cascading and multi-scale feature refinement.

Background

The current popular semantic segmentation networks are mostly focused on accuracy, and a large amount of calculation overhead is needed for the networks, so that the reasoning speed is very slow, and the networks are difficult to deploy in actual application scenes. On the other hand, much effort sacrifices the performance of the split network in pursuit of real-time reasoning speed. Thus, balancing accuracy and real-time becomes a difficult challenge in the field of semantic segmentation.

Since the appearance of the full convolution network FCN, semantic segmentation is brought to a brand new direction, and the maximum change is that the final full connection layer of the original CNN is replaced by a convolution layer, so that pixel-level dense prediction is realized, and the segmentation precision is greatly improved. From this, many semantic segmentation models are presented, which are all made of FCN architecture, such as U-Net, segNet, deep lab series, refineNet, PSPNet, etc., and some models focusing on accuracy, and the above segmentation models all obtain very high accuracy on the citingscapes dataset, and these semantic segmentation models all use large and complex backbone networks, which have many operations with relatively large computational overhead, although features in images can be fully extracted, a large number of complex computation operations also cause very slow reasoning speed of the network, so that some application scenarios with requirements on real-time performance cannot be satisfied.

Aiming at the problems, under the condition of poor resources, the real-time semantic segmentation meeting the real-time requirement can be realized, and the high prediction accuracy is achieved at a high reasoning speed by comprehensively considering the parameters, the calculation complexity, the accuracy, the reasoning speed and the like in the segmentation network.

Disclosure of Invention

In order to solve the existing technical problems, the embodiment of the invention provides a semantic segmentation method and device based on context cascading and multi-scale feature refinement. The technical scheme is as follows:

in a first aspect, there is provided a processing method for image segmentation, where the method is applied to a convolutional neural network, where the convolutional neural network includes a semantic segmentation network based on context cascading and multi-scale feature refinement, and the semantic segmentation network based on context cascading and multi-scale feature refinement further includes: a backbone network, an up-down Wen Jilian module, a multi-scale feature refinement module and an up-sampling module; the method comprises the following steps:

after inputting the image into the backbone network, coding semantic information in the image;

inputting the feature images processed by the backbone network into the upper Wen Jilian module and the lower Wen Jilian module, and performing cascading operation on the feature images of different receptive fields at all levels to obtain a feature image of multi-scale context information with global features;

inputting the feature map processed by the backbone network into the multi-scale feature refinement module, obtaining multi-scale space information at a low-dimensional stage through channel segmentation and convolution, and obtaining a low-dimensional multi-scale space feature map after attention guidance;

performing depth fusion on the low-dimensional multi-scale space feature map and the feature map with the multi-scale context information of the global features in the multi-scale feature refinement module;

and inputting the depth-fused feature images into an up-sampling module, and obtaining the feature images with the same size as the original images after up-sampling.

Further, the context cascade module includes: the dense cascade expansion convolution module with different expansion rate combinations inputs the feature images processed by the backbone network into the upper and lower Wen Jilian modules, performs cascade operation on the feature images of different receptive fields at each level, and obtains a feature image of multi-scale context information with global features, comprising:

after the feature images processed by the backbone network enter the upper Wen Jilian module and the lower Wen Jilian module and are compressed through channels, the dense cascade expansion convolution modules with different expansion rate combinations are sequentially input, depth separable convolution and channel decrement with different expansion rates are sequentially carried out in each dense cascade expansion convolution module for a plurality of times, different size target features in the feature images are extracted, multi-scale context information of different size targets in the feature images is obtained, after the channel is spliced, the feature images are fused with the feature images which are originally input, and the dense cascade expansion convolution modules with different expansion rate combinations obtain respective multi-scale feature images with different receptive fields;

the upper Wen Jilian module and the lower Wen Jilian module are used for cascading the multi-scale characteristic diagrams of the different receptive fields at each level to obtain a multi-scale context information characteristic diagram with local characteristics;

carrying out global pooling operation on the feature map processed by the backbone network in the upper Wen Jilian module and the lower Wen Jilian module after channel compression, and obtaining a feature map with global feature representation through up-sampling;

and the upper and lower Wen Jilian modules perform cascading operation on the feature map with the global feature representation and the multi-scale context information feature map with the local feature in a short-term dense cascading manner to obtain the feature map with the multi-scale context information with the global feature.

Further, the number of the dense cascade expansion convolution modules is three, and different expansion rates of the three dense cascade expansion convolution modules are set from small to large.

Further, inputting the feature map processed by the backbone network into the multi-scale feature refinement module, and obtaining a low-dimensional multi-scale space feature map through channel segmentation and convolution, including:

the feature map of the low-dimensional stage is divided into four branches through a channel;

the four branches are convolved in parallel in a separable way through the depths of different expansion rates respectively, so that multi-scale space information of a low-dimensional stage is enriched.

Furthermore, the multi-scale space information in the low-dimensional stage is guided by an attention mechanism, and constraint is added to obtain a low-dimensional multi-scale space feature map guided by the attention.

Further, the attention mechanism employs channel attention.

Further, the multi-scale feature refinement module integrates the low-dimensional multi-scale space feature map guided by attention and the feature map of the multi-scale context information with global features, integrates the high-dimensional features with the low-dimensional features, and refines the space detail information.

Further, the upsampling module restores the resulting feature map to the original image size.

In a second aspect, there is provided an image segmentation processing apparatus, the apparatus being applicable to a convolutional neural network including a semantic segmentation network based on context concatenation and multi-scale feature refinement, the apparatus comprising:

the backbone network module is used for carrying out feature coding on the input image and extracting semantic information;

the upper Wen Jilian module and the lower Wen Jilian module are used for carrying out channel fusion on the output characteristic diagrams of different receptive fields at all levels through cascade operation on the characteristic diagrams processed by the backbone network module, and obtaining a characteristic diagram of multi-scale context information with global characteristics;

the multi-scale feature refinement module is used for guiding the captured low-dimensional multi-scale space features according to the high-dimensional features realized by the attention mechanism, integrating the high-dimensional features and the low-dimensional features of the network and refining the space detail information;

and the up-sampling module is used for restoring the size of the result characteristic diagram to the original input image size.

Further, the upper and lower Wen Jilian modules further include:

and the dense cascade expansion convolution module is used for extracting target features with different sizes from the image according to the multi-view receptive field.

The technical scheme provided by the embodiment of the invention has the beneficial effects that: a new efficient real-time semantic segmentation network (CCMFRNet) is provided through a semantic segmentation method and a semantic segmentation device based on context cascading and Multi-scale feature refinement, based on a context Wen Jilian module (Context Cascade Module, CCM) and a Multi-scale feature refinement module (Multi-scale Feature Refinement Module, MFRM), the CCM carries out channel fusion on a dense cascading expansion convolution module (Dense Cascade Dilated Convolution Module, DCDM) with 3 different expansion rate combinations in a short-term dense cascading mode, and captures abundant Multi-scale context information, so that the segmentation effect is improved. The MFRM adopts SE channel attention to realize that high-dimensional features guide the captured low-dimensional multi-scale space features, enrich the feature space of the low-dimensional stage, promote the feature fusion of depth of the low-dimensional features and the high-dimensional features, and effectively and efficiently refine space detail information. Both CCM and MFRM can effectively improve network learning ability, so that CCMFRNet has better convergence and higher precision, and better balance between precision and efficiency is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of an overall structure of a processing method for image segmentation according to an embodiment of the present invention;

FIG. 2 is a flow chart of a dense cascade expansion convolution module (DCDM) in a processing method for image segmentation according to an embodiment of the invention;

FIG. 3 is a flow chart of an up and down Wen Jilian module (CCM) in a processing method for image segmentation according to an embodiment of the present invention;

fig. 4 is a device structure diagram of a processing method for image segmentation according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present invention provides a processing method for image segmentation, where the method is applied to a convolutional neural network, and the convolutional neural network includes a semantic segmentation network (CCMFRNet) based on context cascading and Multi-scale feature refinement, where CCMFRNet is used for real-time semantic segmentation, and the method mainly includes a backbone network, an up-down Wen Jilian module (Context Cascade Module, CCM), a Multi-scale feature refinement module (Multi-scale Feature Refinement Module, MFRM), and an up-sampling module. CCMFRNet aims to capture multi-scale context information and refine spatial detail information in an efficient manner, achieving overall balance of accuracy and real-time. The method comprises the following steps:

1. after the image is input into the backbone network, performing feature coding on the input image, and extracting semantic information;

feature maps of different resolutions typically have different representation capabilities, while spatial detail and semantic features are critical to the accuracy of semantic segmentation. In the method, a lightweight backbone network is used as an encoder, and in the process of backbone network feature extraction, the low-dimensional stage of the network often contains more abundant space details, and the high-dimensional stage of the network contains more semantic information. In order to perform feature encoding on an original input image, semantic information is extracted, as shown in fig. 1, first, a backbone network is empirically divided into eight stages, which are respectively referred to as stage0 to stage7, wherein stage0 to stage3 are defined as low-dimensional stages (spatial details), standard convolution operation is set, step size is set to 2, an encoder Output Step (OS) of image feature extraction is set to 8, and high-resolution features can retain more detail information and spatial dimensions, and image features are extracted. When the feature map enters the backbone network, the resolution of the feature map is reduced to 1/2 of the previous level at each level, the resolution of the original map image is changed from 1/2 to 1/4 until the resolution of the original map is changed to 1/8 of the resolution of the original map, namely, semantic information is extracted through continuous underground sampling and convolution operation; and defining stage4 to stage7 as a high-dimensional stage, in order to ensure the final segmentation effect, expanding a receptive field under the condition of not reducing image resolution by introducing an expansion rate into standard convolution, and performing expansion convolution to keep the high resolution of a feature map, wherein the image resolution is not reduced from stage4, the image resolution is uniformly kept to be 1/8 of the original image resolution, semantic extraction is continuously performed, and feature coding is finished on an original input image at the tail stage7 of the backbone network.

2. Inputting the feature images processed by the backbone network into the upper Wen Jilian module and the lower Wen Jilian module, and performing cascading operation on the feature images of different receptive fields at all levels to obtain a feature image of multi-scale context information with global features;

in order to capture abundant multi-scale context information and improve segmentation effect on the basis of keeping low calculation amount and parameter amount, a single-branch cascade structure up-down Wen Jilian module (CCM) is provided. As shown in fig. 3, the upper and lower Wen Jilian modules (CCMs) include three closely cascaded expansion convolution modules (Dense Cascade Dilated Convolution Module, DCDM) of different expansion rate combinations, DCDM-A, DCDM-B, DCDM-C. Firstly, a feature map with 1/8 of original resolution outputted by a backbone network tail stage7 is recorded asThere are four operations after the signature is entered into the CCM.

Specifically, the first stage operation: feature map to be inputThe number of channels is compressed by a standard convolution of 1 x 1, aiming at reducing the calculation amount, by multiplying the instruction and inputting the characteristic diagram +.>Multiplication is performed, then additive fusion is performed with the offset vector, so that model overfitting is prevented, namely, a random value is added to the model overfitting, then batch standardization (Batch Normalization, BN) and PReLU activation functions are performed, accuracy is improved under the condition that negligible additional calculation cost is increased, and a compressed feature map is obtained. Wherein the number of channels after compression is recorded as C _r The compressed characteristic diagram is recorded asThe above operation can be expressed by formula 1:

w in formula 1 ^1×1 Representation of1 x 1 convolution, b denotes the offset vector,indicating the operation of Batch Normalization (BN) and activation functions (prime).

Next, the compressed feature mapThree DCDMs (DCDM-A, DCDM-B, DCDM-C) were entered, where within each DCDM there was a set of dilations, the average set of dilations being D' = { D _a ，D _b ，D _c }，D _a ＜D _b ＜D _c The expansion rate is set from small to large, and is aimed at acquiring multi-scale context information. It should be noted that the number of input channels and the number of output channels of the feature map passing through the DCDM are consistent, the set of output feature maps of each stage in the CCM is denoted as M',

second stage operation: compressed feature mapenteringDCDM-A,whereintheexpansionrateissmaller,thecharacteristicsofthesmall-sizetargetareextracted,andthecharacteristicdiagramforextractingthesemanticinformationofthesmall-sizetargetisobtainedthroughdepthseparableconvolutionandgroupingconvolutionofdifferentexpansionratesandismarkedas->The specific operation can be expressed by formula 2:

third stage operation: feature mapintoDCDM-B,theinternalexpansionratioislargerthanthatofDCDM-A,aimingatExtracting the characteristics of the medium-size target, and obtaining a characteristic diagram for extracting the semantic information of the medium-size target through depth separable convolution and grouping convolution with different expansion rates, wherein the characteristic diagram is marked as +.>The specific operation can be expressed by formula 3:

fourth stage operation: feature mapEntering DCDM-C, and expanding the internal expansion rate of the DCDM-C compared with the expansion rate setting of DCDM-B to extract the characteristics of a large-size target, and obtaining a characteristic diagram for extracting semantic information of the large-size target through depth separable convolution and grouping convolution of different expansion rates, wherein the characteristic diagram is marked as ++>The specific operation can be expressed by formula 4:

in formulas 2, 3, and 4, dcdm_a (x), dcdm_b (x), and dcdm_c (x) respectively represent DCDM with a combination of expansion ratios A, B, C.

That is, as shown in fig. 3, after the feature map enters the CCM, the receptive field is gradually increased when passing through the DCDM-A, DCDM-B, DCDM-C, jump connection is performed by a short-term dense cascade manner, the feature maps of different receptive fields at each stage are spliced according to channel dimensions, and the output feature map of the DCDM-A, DCDM-B, DCDM-C is spliced and fused, so as to obtain a multi-scale feature map with local features, wherein the number of channels is 3Cr.

Further, four operations are also set for the feature map entering each DCDM, as shown in fig. 2, the feature after channel compressionDrawing of the figurewhenenteringtheDCDM-A,theexpansionrateissmaller,sothereceptivefieldissmaller,theprinciplethatthereceptivefieldisexpandedstepbystepisfollowedintheDCDM-A,the3X3depthseparableconvolutionwithdifferentexpansionratesissequentiallycarriedoutfourtimes,andtheexpansionratesetisD={d ₁ ，d ₂ ，d ₃ ，d ₄ }，d ₁ ＜d ₂ ＜d ₃ ＜d ₄ The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the channel number is reduced at a reduction rate of 1/2, and the last stage operation is not reduced, so that the calculated amount is greatly reduced, and a lightweight structure is maintained. The output characteristic diagram set of each stage is as follows: />

Namely, the first stage operation: feature map after channel compressionIn the case of the separable convolution with the first 3X 3 depth, the number of channels is reduced by 1/2, i.e. the expansion ratio is d ₁ When the channel number of the feature map is changed to Cr/2, the output feature map of the first stage is obtained>The operation of the first stage can be expressed as follows in equation 5:

second stage operation: output characteristic diagram of first stageWhen the second 3X 3 depth separable convolution is passed, the expansion rate is increased compared with the former one, and the receptive field is increased, namely the expansion rate is d ₂ When the channel number of the characteristic diagram is changed into Cr/4, the output characteristics of the second stage are obtainedSyndrome/pattern of->The operation of the second stage can be expressed as follows in equation 6:

third stage operation: output profile of second stageWhen the third 3×3 depth separable convolution is passed, the expansion ratio is amplified again than the former, the receptive field continues to be increased, i.e. the expansion ratio is d ₃ When the channel number of the characteristic diagram is changed to Cr/8, the output characteristic diagram of the third level is obtained>The operation of the third stage can be expressed as follows in equation 7:

fourth stage operation: output characteristic diagram of third stageWhen the fourth 3X 3 depth separable convolution is performed, the expansion rate is adjusted to be larger than that of the former one, namely the receptive field is further enlarged, namely the expansion rate is d ₄ However, in order to avoid losing too much characteristic information, the fourth stage does not perform channel decrementing operation, the number of channels of the characteristic diagram is unchanged at Cr/8, and the output characteristic diagram of the fourth stage is recorded asThe operation of the fourth stage can be expressed as follows in equation 8:

of these formulaeIndicating an expansion ratio d _z Is deconvoluble at a depth of 3 x 3.

In an actual application scene, objects with different sizes exist, if a network model learns by using a receptive field with a single visual angle, effective feature extraction is difficult to be carried out on the objects with different sizes, so that the network model can capture multi-scale context information through the receptive field with multiple visual angles, and the prediction accuracy of the network model is improved. First the four-stage operation in DCDM is a depth separable convolution with different dilations, the dilations rate set d= { D ₁ ，d ₂ ，d ₃ ，d ₄ }，d ₁ ＜d ₂ ＜d ₃ ＜d ₄ The configuration is to make four-stage operation in DCDM have receptive fields with different visual angles, then to make channel fusion to the output characteristic diagram of four-stage operation by short-term dense cascade, so that DCDM can capture multi-scale context information and enhance information representation. That is, after the DCDM four-stage operation, the output characteristic diagram of each stage is respectively obtained to carry out channel splicing, so as to obtain the characteristic diagram after the four-stage operation cascade connectionThe above operation can be expressed by equation 9:

where C represents channel splice.

Since the DCDM is also a single-branch cascade structure, and the CCM including three DCDMs is also a single-branch cascade structure, the depth of the network is deepened, degradation of the network is possibly caused, and in order to prevent the degradation of the network, the feature map after four-stage operation and cascade is connected by jumping againandoriginalinputfeaturemapofDCDM-A+.>Element addition to obtain and output channel number C _r Multi-scale feature map->The number of the channels of the multi-scale feature map after the splicing and fusion is consistent with that of the channels of the feature map which is input originally. The DCDM is caused to capture multi-scale context information, enhancing the information representation. The structures of dcdm_b and dcdm_c are the same as those of dcdm_a, except that the internal expansion ratio combinations are different, so the same is available, but will be described in detail. The above operations can be expressed by the following equation 10:

wherein the method comprises the steps ofRepresenting element addition.

That is, as shown in FIG. 2, the feature map after channel compressionThe input DCDM_A is subjected to depth separable convolution and grouping convolution to obtain an output characteristic diagram +.>Dcdm_ B, DCDM _c is similarly available.

Through the above flow, as shown in fig. 3, the CCM adopts a single branch cascade structure to cascade the outputs of different receptive fields at different levels, while capturing multi-scale context information and enhancing information representation, the context information is local features, and cannot acquire information with global feature representation, so that the segmentation effect of large-size objects in the feature map is poor, and in order to solve the problem, the method uses the following stepsFeature map after shrinking channelThe obtained pooled feature map is marked as +.>The specific operation is expressed by formula 11:

pooling the post-pooling feature mapRestoring to the original feature map size by upsampling to obtain a feature map with a global feature representation and denoted +.>The specific operation can be expressed by the following formula 12:

wherein GAP in formulas 11, 12 represents global average pooling and S in formulas represents upsampling.

To obtain richer multi-scale context information, CCM adopts a short-term dense cascade structure, and is to beCascading with the output characteristic diagram of the following three-stage operation, the obtained characteristic diagram is marked as +.>The feature map has both rich multi-scale context information and information with global feature representations integrated. The above operation can be expressed by equation 13:

the DCDM is a core component of the CCM, and the two modules adopt a single-branch cascade structure, a depth separable convolution and a grouping convolution to decompose the standard convolution, and expand the convolution by introducing an expansion rate into the standard convolution, so that the receptive field is expanded under the condition of not reducing the resolution of the image, and the high resolution of the feature map is maintained. The depth separable convolution and the packet convolution are typically used together. The depth separable convolution decomposes the 1 x 1 standard convolution into two steps, the first step is a channel-by-channel convolution, grouping the number of channels, and each group has only a corresponding group of channels to perform the convolution operation, which is the group convolution. Compared with standard convolution, cross-channel convolution can be avoided, and the parameter and the calculated amount are greatly reduced; the second step is point-by-point convolution, and information in the channel dimension is integrated through the point-by-point convolution; and then the channel number decrementing operation is carried out, so that the continuity and the relevance of the information are maintained. Through the operation, the CCM obtains richer multi-scale context information with fewer computing resources and higher accuracy, and the segmentation effect is improved.

3. Inputting the feature map processed by the backbone network into the multi-scale feature refinement module, obtaining multi-scale space information at a low-dimensional stage through channel segmentation and convolution, and obtaining a low-dimensional multi-scale space feature map after attention guidance;

because the feature map size is amplified by up-sampling for multiple times of the symmetrical encoding and decoding structure, and multiple times of feature fusion operation exist, the calculation cost and the memory occupation are increased, and the reasoning speed is low. As shown in the black frame part of FIG. 1, the number of output channels of CCM is first adjusted, and the output characteristic diagram of CCM is obtainedChannel compression is performed by 1×1 convolution to reduce the calculation amount, and the number of channels after adjustment is marked as C _l The specific operation is similar to that of formula 1, and is not described here too much; backbone network stage3The feature map of the output of (2) is marked +.>At the same time cut C through the channel _l Is arranged into four branches to make the characteristic diagramInto a multi-branch parallel structure, the branches are depth separable convolutions with different expansion rates, and the expansion rate set is recorded as R = { R ₁ ，r ₂ ，r ₃ ，r ₄ The low-dimensional multi-scale space feature set captured by the above operation is marked as +.>Where i is 1, 2, 3, 4, the above operation can be expressed by equation 14:

wherein, the liquid crystal display device comprises a liquid crystal display device,indicating a void fraction r _i Can be convolved separately.

At this point, the MFRM captures a low-dimensional multi-scale spatial feature map, as in (a) of fig. 1, enriching the feature space of the low-dimensional phase.

The low-dimensional stage of the network is filled with a large amount of noise information, if the low-dimensional features and the high-dimensional features are directly fused, the noise information can cause interference to final prediction, the relation among MFRM focused attention channels is set, and the attention vector set of a CCM output feature map is obtained through SE channel attention and is marked as V _l Which contains l attention vectors. And restraining the low-dimensional multi-scale space feature map by using the attention so that the low-dimensional multi-scale space feature map can independently select to restrain noise information, thereby obtaining high-quality multi-scale space detail information. Specifically, the Softmax function weights the attention vector of each channel obtained aboveDoing so, and putting V _l The averages are divided into four groups, i.eWill V _l And low-dimensional multiscale spatial feature set +.>Sequentially multiplying the elements according to the corresponding group numbers, finally splicing according to the channel dimension to obtain a low-dimensional multi-scale space feature map guided by attention, and marking as F _se As in (b) of fig. 1, the above operation can be expressed by equation 15:

wherein the method comprises the steps ofRepresenting element multiplication.

4. Performing depth fusion on the low-dimensional multi-scale space feature map guided by attention and the feature map with the multi-scale context information of global features in the multi-scale feature refinement module;

in order to refine the spatial detail information more effectively and efficiently, according to the operation, the attention directed low-dimensional multi-scale spatial feature map F is continued _se And the output of the channel-regulated CCMAnd adding elements to obtain a depth fusion feature map, so that more effective and efficient refinement of space detail information is realized, and the segmentation effect is improved. The above operation can be expressed by equation 16:

the MFRM uses the depth separable convolution and a simple long jump connection, so that the memory occupation and the calculation cost are reduced, and the space information can be effectively thinned only by occupying less calculation resources; the attention mechanism is added to improve the segmentation performance and the prediction accuracy of small-size objects; meanwhile, the MFRM achieves considerable accuracy through a lighter structure, high-dimensional features are realized to guide the captured low-dimensional multi-scale space features, feature spaces of low-dimensional stages are enriched, deep feature fusion of the low-dimensional features and the high-dimensional features is promoted, effective and efficient refinement of space detail information is achieved, and the defect of insufficient space detail information of the high-dimensional stages is overcome.

5. And inputting the depth-fused image into an up-sampling module, and obtaining a feature map with the same size as the original image after up-sampling.

And finally, directly upsampling the output depth-fused feature map to the size of the original input image through an Upsampling Layer (UL) to obtain the feature map with the same size as the original input image.

The invention discloses a semantic segmentation method based on context cascading and multi-scale feature refinement, which is applied to a convolutional neural network, wherein the convolutional neural network comprises a high-efficiency real-time semantic segmentation network (CCMFRNet) with the context cascading and the multi-scale feature refinement, and further comprises the following steps: backbone network, up-down Wen Jilian module (CCM), multi-scale feature refinement module (MFRM), up-sampling module; the method comprises the following steps: the CCM utilizes 3 dense cascade expansion convolution modules (DCDM) with different expansion rate combinations to perform channel fusion in a short-term dense cascade mode, captures rich multi-scale context information and improves segmentation effect. The MFRM adopts SE channel attention to realize that high-dimensional features guide the captured low-dimensional multi-scale space features, enrich the feature space of the low-dimensional stage, promote the feature fusion of depth of the low-dimensional features and the high-dimensional features, and effectively and efficiently refine space detail information. Both CCM and MFRM can effectively improve network learning ability, so that CCMFRNet has better convergence and higher precision, and better balance between precision and efficiency is achieved.

As shown in fig. 4, based on the same inventive concept, a device structure block diagram of a semantic segmentation method based on context concatenation and multi-scale feature refinement of the present application is shown, corresponding to the method of the present application, where the device may be applied in a convolutional neural network, where the convolutional neural network includes a high-efficiency real-time semantic segmentation network (CCMFRNet) with context concatenation and multi-scale feature refinement, and the device includes:

the backbone network module 101 is used for performing feature coding on an input image and extracting semantic information;

the up-down Wen Jilian module 201 is configured to perform channel fusion on the feature graphs processed by the backbone network module and the output feature graphs of different receptive fields at different levels through cascade operation, so as to obtain a feature graph of multi-scale context information with global features;

the multi-scale feature refinement module 301 is used for guiding the captured low-dimensional multi-scale space features according to the attention mechanism, integrating the high-dimensional features and the low-dimensional features of the network and refining the space detail information;

the upsampling module 501 is configured to restore the size of the resulting feature map to the original input image size.

Corresponding to the device of the application, the efficient real-time semantic segmentation network (CCMFRNet) with the context cascade and the multi-scale feature refinement specifically further comprises:

the dense cascade expansion convolution module 401 is configured to extract target features with different sizes from the image according to the multi-view receptive field.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product on one or more computer-readable storage media (including, but not limited to, read-only memory, magnetic or optical disks, and the like) having computer-usable program code embodied therein.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The image segmentation processing method is characterized by being applied to a convolutional neural network, wherein the convolutional neural network comprises a semantic segmentation network based on context cascading and multi-scale feature refinement, and the semantic segmentation network based on the context cascading and the multi-scale feature refinement further comprises: a backbone network, an up-down Wen Jilian module, a multi-scale feature refinement module and an up-sampling module; the method comprises the following steps:

2. The method of claim 1, wherein the contextual cascade module comprises: a plurality of closely cascaded expansion convolution modules of different expansion rate combinations;

inputting the feature map processed by the backbone network into the up-down Wen Jilian module, and performing cascading operation on the feature maps of different receptive fields at different levels to obtain a feature map of multi-scale context information with global features, wherein the method comprises the following steps:

3. The method of claim 2, wherein there are three closely-cascaded expansion convolution modules, and wherein the three closely-cascaded expansion convolution modules have different expansion ratios set from small to large.

4. The method of claim 1, wherein inputting the feature map processed by the backbone network into the multi-scale feature refinement module obtains a low-dimensional multi-scale spatial feature map through channel segmentation and convolution, comprising:

5. The method of claim 4, wherein the multi-scale spatial information of the low-dimensional stage is guided by an attention mechanism, and constraints are added to obtain an attention-guided low-dimensional multi-scale spatial feature map.

6. The method of claim 5, wherein the attention mechanism employs channel attention.

7. The method of claim 1 or 5, wherein the multi-scale feature refinement module integrates the attention-directed low-dimensional multi-scale spatial feature map and the feature map of multi-scale context information with global features, integrates high-dimensional features with low-dimensional features, and refines spatial detail information.

8. The method of claim 1, wherein the upsampling module restores the resulting feature map to an original image size.

9. An image segmentation processing apparatus, the apparatus being applicable to a convolutional neural network including a semantic segmentation network based on context concatenation and multi-scale feature refinement, the apparatus comprising:

10. The apparatus of claim 9, wherein the upper and lower Wen Jilian modules further comprise: