CN108876792B

CN108876792B - Semantic segmentation method, device and system and storage medium

Info

Publication number: CN108876792B
Application number: CN201810333056.7A
Authority: CN
Inventors: 章圳黎; 张祥雨; 彭超
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2018-04-13
Filing date: 2018-04-13
Publication date: 2020-11-10
Anticipated expiration: 2038-04-13
Also published as: CN108876792A

Abstract

The embodiment of the invention provides a semantic segmentation method, a semantic segmentation device, a semantic segmentation system and a storage medium. The method comprises the following steps: acquiring an image to be processed; inputting the image to be processed into a U-type network to obtain a semantic segmentation result of the image to be processed output by the U-type network, wherein a contraction path of the U-type network comprises n convolution modules which are connected in sequence, the output characteristic of the ith convolution module in the n convolution modules is combined with the output characteristic of at least one convolution module after the ith convolution module, and the combined characteristic is jumped and connected to the output end of an deconvolution layer corresponding to the ith convolution module in an expansion path of the U-type network, wherein n is an integer larger than 1, and i is more than or equal to 1 and less than n. According to the semantic segmentation method, the semantic segmentation device, the semantic segmentation system and the storage medium, due to the fact that the improved U-shaped network with the shallow features and the deep features capable of being well fused is adopted, a more accurate semantic segmentation result can be obtained.

Description

Semantic segmentation method, device and system and storage medium

Technical Field

The present invention relates to the field of computers, and more particularly, to a semantic segmentation method, apparatus and system, and a storage medium.

Background

Semantic segmentation (semantic segmentation) is a relatively basic task in computer vision. At present, the task is mainly solved by a Convolutional Neural Network (CNN) or a method of following a full Convolutional Network (full Convolutional Network), because a semantic segmentation task needs to classify each pixel point on an image. One network architecture for semantic segmentation that is currently the mainstream is the U-network (i.e., U-Net). There is room for improvement in the existing U-Net.

Disclosure of Invention

The present invention has been made in view of the above problems. The invention provides a semantic segmentation method, a semantic segmentation device, a semantic segmentation system and a storage medium.

According to an aspect of the present invention, a semantic segmentation method is provided. The method comprises the following steps: acquiring an image to be processed; inputting the image to be processed into a U-type network to obtain a semantic segmentation result of the image to be processed output by the U-type network, wherein a contraction path of the U-type network comprises n convolution modules which are connected in sequence, the output characteristic of the ith convolution module in the n convolution modules is combined with the output characteristic of at least one convolution module after the ith convolution module, and the combined characteristic is jumped and connected to the output end of an deconvolution layer corresponding to the ith convolution module in an expansion path of the U-type network, wherein n is an integer larger than 1, and i is more than or equal to 1 and less than n.

Illustratively, for each of at least one convolution module after the ith convolution module, the output characteristic of the convolution module is input to a convolution layer, the output characteristic of the convolution layer is input to an upsampling layer, the output characteristics of all upsampling layers corresponding to at least one convolution module after the ith convolution module and the output characteristic of the ith convolution module are subjected to element corresponding multiplication, and the result of the element corresponding multiplication is the combined characteristic.

Illustratively, the method further comprises: acquiring a training image and corresponding classification marking data, wherein the classification marking data is used for indicating the probability that the training image belongs to at least one preset class; inputting training images into a U-shaped network; for each of one or more convolution modules in the n convolution modules, inputting the output characteristics of the convolution module into a semantic supervision module corresponding to the convolution module so as to obtain a classification result of a training image output by the semantic supervision module; for each of one or more convolution modules in the n convolution modules, calculating a classification loss corresponding to the convolution module based on the classification result and the classification label data; calculating a total loss based on the classification losses corresponding to the one or more convolution modules; the U-type network is optimized based on the total loss to obtain a trained U-type network.

Illustratively, the semantic supervisor module corresponding to each of the one or more convolution modules includes two convolution layers, a global pooling layer, a fully-connected layer, and a classification function layer.

Illustratively, for each of at least one of the 2 nd to the n-1 th convolution modules, the ratio of the number of residual convolution units included in the convolution module to the total number of residual convolution units included in the n convolution modules is greater than the preset ratio corresponding to the convolution module.

Illustratively, n is equal to 5, and the 2 nd to nth convolution modules of the n convolution modules include residual convolution units of 8, 8, 9, and 8, respectively.

According to another aspect of the present invention, there is provided a semantic segmentation apparatus, including: the first acquisition module is used for acquiring an image to be processed; the first input module is used for inputting the image to be processed into the U-type network so as to obtain a semantic segmentation result of the image to be processed output by the U-type network, wherein the contraction path of the U-type network comprises n convolution modules which are connected in sequence, the output characteristic of the i-th convolution module in the n convolution modules is combined with the output characteristic of at least one convolution module after the i-th convolution module, and the combined characteristic is connected to the output end of the deconvolution layer corresponding to the i-th convolution module in the expansion path of the U-type network in a jumping mode, wherein n is an integer larger than 1, and i is more than or equal to 1 and less than n.

According to another aspect of the present invention, there is provided a semantic segmentation system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for executing the above semantic segmentation method when executed by the processor.

According to another aspect of the present invention, there is provided a storage medium having stored thereon program instructions for performing the above semantic segmentation method when executed.

Compared with the existing U-Net, the U-Net provided by the embodiment of the invention fuses the high-level features of the subsequent convolution modules at each convolution module, so that the shallow features can have higher semantic information. Therefore, the difference between the shallow feature and the deep feature in the aspect of semantic information is not large as before, and the shallow feature and the deep feature can be better fused, so that the processing effect of the whole network can be improved. According to the semantic segmentation method, the semantic segmentation device, the semantic segmentation system and the storage medium, which are disclosed by the embodiment of the invention, the improved U-Net is adopted, so that a more accurate semantic segmentation result can be obtained.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail embodiments of the present invention with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 is a schematic diagram of a present U-Net architecture;

FIG. 2 shows segmentation results obtained with different feature levels under a conventional U-Net framework;

FIG. 3 illustrates a schematic block diagram of an example electronic device for implementing semantic segmentation methods and apparatus in accordance with embodiments of the present invention;

FIG. 4 shows a schematic flow diagram of a semantic segmentation method according to one embodiment of the invention;

FIG. 5 shows a schematic diagram of a partial network structure of U-Net according to one embodiment of the invention;

FIG. 6 shows a schematic diagram of the network structure of ResNeXt-50 and ResNeXt-50;

FIG. 7 illustrates a functional diagram of an SEB module according to one embodiment of the present invention;

FIG. 8 shows a schematic diagram of a partial network structure of U-Net during a training phase according to one embodiment of the invention;

fig. 9 is a schematic diagram illustrating a network structure of an SS module according to an embodiment of the present invention;

FIG. 10 illustrates the segmentation results obtained for a given feature level for the presence of U-Net and U-Net according to an embodiment of the present invention, respectively;

FIG. 11 shows the results of performance testing of various existing semantic segmentation networks and U-Net according to embodiments of the present invention on a PASCAL VOC 2012 validation set;

FIG. 12 shows a schematic block diagram of a semantic segmentation apparatus according to one embodiment of the present invention; and

FIG. 13 shows a schematic block diagram of a semantic segmentation system according to one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of embodiments of the invention and not all embodiments of the invention, with the understanding that the invention is not limited to the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention described herein without inventive step, shall fall within the scope of protection of the invention.

The network structure of U-Net designed by the former people is mainly based on the following subjective idea: by fusing the features of the low-resolution high-semantic information (feature map) and the features of the high-resolution low-semantic information, the features of the high-resolution high-semantic information (desired result) can be obtained. However, it is rarely explored if this idea is truly correct, and if the fusion of these two features is truly effective.

The inventor firstly finds that the subjective idea of fusing the characteristics of low-resolution high-semantic information and the characteristics of high-resolution low-semantic information together under the existing U-Net framework has certain defects. The specific expression is that whether the first two hop connections (shortcuts) exist in the U-Net has little influence on the performance of the U-Net. The following description is made with reference to examples.

FIG. 1 is a schematic diagram of a conventional U-Net architecture. Referring to fig. 1, U-Net may include a left contraction path and a right expansion path. U-Net can be understood as an encoder-decoder structure, the systolic path being the encoder and the systolic path being the decoder. The encoder gradually reduces the spatial dimensions of the pooling layer and the decoder gradually restores the details and spatial dimensions of the object. Since the pooling process of the encoder part can cause information loss, the segmentation map generated by upsampling in the decoder part is generally rough, and therefore, shortcut can be introduced between the encoder and the decoder to improve the roughness of the upsampling and help the decoder to better repair the details of the target.

In fig. 1, there are four shortcuts, shown as S1, S2, S3, and S4, respectively. It can be understood by those skilled in the art that each shortcut is a shortcut connection from the contraction path to the expansion path of the U-Net, and will not be described herein. It is to be understood that fig. 1 is only a schematic diagram in principle, and that U-Net may vary in some details in its implementation.

Fig. 2 shows the segmentation results obtained with different feature levels under the existing U-Net framework. In FIG. 2, the segmentation results for two U-nets whose feature extraction networks were constructed based on pre-trained ResNet-50 and ResNeXt-101 models, respectively, are shown. In fig. 2, the U-Net segmentation results (representing performance) were evaluated against the PASCAL VOC 2012 validation set using Mean Intersection over Union (mlou). In fig. 2, each item in the feature level column indicates that the corresponding shortcut is connected, e.g., {3,4} indicates that S3 and S4 shown in fig. 1 are connected, and S1 and S2 are not connected. As shown in FIG. 2, the performance of the U-Net connecting S2 is only marginally increased, even though the performance of both S1 and S2 is still not increased, whether under the ResNet-50 model or the ResNeXt-101 model.

The inventors speculate that the reason for the above problem is that the features of the low-resolution high-semantic information and the features of the high-resolution low-semantic information in U-Net have a large difference and gap (gap) in resolution and semantic information, and the gap is too large to be complementary to each other when the two features are fused.

Therefore, the inventor believes that introducing more semantic information into the shallow features may help to compensate for the gap, so that the two features can be better fused to achieve the desired effect. The method of introducing more semantic information into the shallow features to help better blend the features will be described in detail below.

It is to be understood that the features of the high resolution low semantic information (i.e. shallow features) may be understood as including the features of the base model (corresponding to the contraction path of U-Net) that are more advanced, such as the features of the outputs of the two convolution modules conv2 and conv3 described below. Features of low resolution high semantic information (i.e., deep features) may be understood to include features that are later compared in the base model, such as features output by two convolution modules, conv4, conv5, described below. In addition, since the expansion path of U-Net upsamples the output features of conv5 to gradually repair the detail and spatial dimensions of the object, the features output by each network layer in the expansion path can also be regarded as features belonging to low-resolution high-semantic information.

The embodiment of the invention provides a semantic segmentation method, a semantic segmentation device, a semantic segmentation system and a storage medium. According to the semantic segmentation method provided by the embodiment of the invention, the new U-Net is adopted to carry out semantic segmentation on the image. The new U-Net can be obtained by improving the network structure of the existing U-Net. In the new U-Net, the output characteristics of the convolution module at the later stage in the contraction path and the output characteristics of the convolution module at the earlier stage are combined together, and then jump and connect to the expansion path of the U-Net. Therefore, the deep high semantic information features are added into the shallow features, so that the shallow features and the deep features can be better fused, the U-Net performance can be improved, and a better semantic segmentation effect can be obtained. The semantic segmentation method and device provided by the embodiment of the invention can be applied to any field needing semantic segmentation, such as the fields of geographic information systems, unmanned vehicle driving, medical image analysis, robotics and the like.

First, an example electronic device 300 for implementing the semantic segmentation method and apparatus according to an embodiment of the present invention is described with reference to fig. 3.

As shown in fig. 3, the electronic device 300 includes one or more processors 302, one or more memory devices 304. Optionally, electronic device 300 may also include an input device 306, an output device 308, and an image capture device 310, which may be interconnected via a bus system 312 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 300 shown in fig. 3 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 302 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a microprocessor, the processor 302 may be one or a combination of several of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), or other forms of processing units having data processing capability and/or instruction execution capability, and may control other components in the electronic device 300 to perform desired functions.

The storage 304 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 302 to implement client-side functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 306 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 308 may output various information (e.g., images and/or sounds) to an external (e.g., user), and may include one or more of a display, a speaker, etc. Alternatively, the input device 306 and the output device 308 may be integrated together, implemented using the same interactive device (e.g., a touch screen).

The image capture device 310 may capture images and store the captured images in the storage device 304 for use by other components. The image capture device 310 may be a separate camera or a camera in a mobile terminal, etc. It should be understood that image capture device 310 is merely an example, and electronic device 300 may not include image capture device 310. In this case, other devices having image capturing capabilities may be used to capture an image and transmit the captured image to the electronic device 300.

Illustratively, an exemplary electronic device for implementing the semantic segmentation method and apparatus according to embodiments of the present invention may be implemented on a device such as a personal computer or a remote server.

Hereinafter, a semantic segmentation method according to an embodiment of the present invention will be described with reference to fig. 4. FIG. 4 shows a schematic flow diagram of a semantic segmentation method 400 according to one embodiment of the invention. As shown in fig. 4, the semantic segmentation method 400 includes the following steps S410 and S420.

In step S410, an image to be processed is acquired.

The image to be processed may be any image that needs to be semantically segmented. The image to be processed can be a static image or a video frame in a video. The image to be processed may be an original image acquired by the image acquisition device, or may be an image obtained after preprocessing (such as digitizing, normalizing, smoothing, and the like) the original image.

Illustratively, the image to be processed may be extracted in the form of a tensor, and an image tensor is obtained, which may represent the image to be processed. In this case, inputting the image to be processed into U-Net may be inputting the above-described image tensor into U-Net.

In step S420, the image to be processed is input into a U-type network to obtain a semantic segmentation result of the image to be processed output by the U-type network, wherein a contraction path of the U-type network includes n convolution modules connected in sequence, an output feature of an i-th convolution module of the n convolution modules is combined with an output feature of at least one convolution module after the i-th convolution module, and the combined feature jumps to be connected to an output end of an deconvolution layer corresponding to the i-th convolution module in an expansion path of the U-type network, where n is an integer greater than 1, and i is greater than or equal to 1 and less than n.

The overall network structure of the U-Net according to the embodiment of the present invention can refer to the network structure of the existing U-Net shown in FIG. 1. Illustratively, the remaining network structure of U-Net according to embodiments of the present invention can be consistent with existing U-Net, except that the output characteristics of some subsequent convolution modules of the contracted path of U-Net are combined with the output characteristics of the preceding convolution modules.

Fig. 5 shows a schematic diagram of a partial network structure of U-Net according to one embodiment of the invention. As shown in FIG. 5, the systolic path of U-Net (i.e., that part that belongs to the base model) according to an embodiment of the present invention can include 5 convolution modules, denoted by conv-1, res-2, res-3, res-4, and res-5, respectively. Illustratively, the 5 convolution modules of U-Net may be implemented using a network architecture consistent with the 5 convolution modules conv1, conv2, conv3, conv4, conv5 in a ResNet model (e.g., ResNet-50, ResNet-101, etc.) or a ResNeXt model (ResNeXt-50, ResNeXt-101, etc.). Fig. 6 shows a schematic diagram of the network structure of resenext-50 and resenext-50. Those skilled in the art can understand the network structures of the ResNeXt model and the 5 convolution modules of the ResNeXt model by combining with FIG. 6, and further can understand the network structure of the 5 convolution modules of U-Net according to the embodiment of the present invention.

The manner in which the output characteristics of subsequent convolution modules are combined with the output characteristics of previous convolution modules is described below. Taking res-2 as an example, the output features of res-2, res-3, res-4 and res-5 can be connected together to the Semantic instrumentation Branch (SEB) module corresponding to res-2. In the SEB module, the output characteristics of res-2, res-3, res-4 and res-5 are combined. The combined features are connected to the expansion path of the U-Net in a shortcut mode, and the connection nodes are consistent with connection nodes of res-2 in the existing U-Net. Shortcuts of the output characteristics of res-2, res-3, res-4, and res-5 to the dilated path correspond to S1, S2, S3, and S4, respectively, described above in fig. 2.

The combination manner in the SEB module may be such that, after the output features of at least one convolution module after the ith convolution module are convolved and upsampled, the obtained features are element-wise multiplied with the output features of the ith convolution module to obtain combined features.

FIG. 7 shows a functional schematic diagram of an SEB module according to an embodiment of the present invention. In fig. 7, an "x" symbol indicates that an element corresponds to a multiplication operation. As shown in fig. 7, in the SEB module corresponding to the ith convolution module, a feature map (i.e., an advanced feature map) output by the convolution module after the ith convolution module may be convolved by 3 × 3, and then a bilinear upsampling process may be performed on a convolution result to obtain a new feature map. Subsequently, the feature map (i.e. the low-level feature map) output by the ith convolution module may be subjected to element-wise multiplication with the new feature map to obtain a combined feature. Fig. 7 only shows an exemplary combination manner of a group of high-level features and a group of low-level features, if there are more than one group of high-level features, each group of high-level features may perform respective convolution and upsampling, and finally, the converted results of the multiple groups of high-level features and the low-level features are subjected to element corresponding multiplication operation to obtain the combined features.

By performing convolution and upsampling on each set of high-level features, each set of high-level features can be converted to be consistent with the size of the low-level features to be combined, i.e., each set of high-level features has the same width, height and channel number as the low-level features, so as to be convenient to combine.

Suppose the output characteristic of the I-th convolution module is x_lThe output characteristic of the ith convolution module and at least one of the convolution modules following the ith convolution moduleThe combined features obtained by combining the output features of the convolution modules

Then, the combination formula of the combined feature and the output feature of the deconvolution layer corresponding to the l convolution module can be expressed as follows:

wherein, y_lUpesample (y) is the combined result corresponding to the I convolution module_l+1) And (3) obtaining a result after performing upsampling (or convolution and deconvolution) on the combined result corresponding to the (l + 1) th convolution module.

The expansion path of the U-Net comprises a plurality of deconvolution layers. For example, in U-Net according to an embodiment of the present invention, the deconvolution layer corresponding to the i-th convolution module is a layer in which the output features of the layer and the combined features corresponding to the i-th convolution module are input to the combining module and combined. In the deconvolution layer corresponding to the l-th convolution module, deconvolution may be performed on the merged result corresponding to the l + 1-th convolution module to obtain an output characteristic of the deconvolution layer.

It is understood that the merge operation in U-Net according to embodiments of the present invention can be understood as a process that iterates continuously. For example, the combined features corresponding to the first convolution module are combined with the deconvolution result of the combination result corresponding to the (l + 1) th convolution module to obtain a combination result corresponding to the first convolution module; combining the combined features corresponding to the l-1 th convolution module with the deconvolution result of the combined result corresponding to the l-1 th convolution module to obtain a combined result corresponding to the l-1 th convolution module; and so on.

It is to be understood that the features described herein may be feature maps (feature maps).

The number of subsequent convolution modules combined by each convolution module of the U-Net according to the embodiment of the present invention can be set as needed, and the present invention does not limit this. For example, the output features of res-2 may be combined with the output features of one or more of res-3, res-4, res-5 to obtain combined features, which are then connected to the inflation path in a shortcut manner. As another example, the output features of res-3 may be combined with the output features of one or more of res-4, res-5 to obtain combined features, which are then connected to the inflation path in a shortcut manner. As another example, the output features of res-4 may be combined with the output features of res-5 to obtain combined features, which are then connected to the inflation path in a short manner. In U-Net, the number of layers of the 1 st convolution module conv-1 is small, and simple convolution is mainly performed on the input image, so that shortcut can be selected not to be set here as required. In this case, 2 ≦ i < n, that is, the 1 st convolution block may be omitted, and the combination of the above features is performed at one or more of the 2 nd convolution block through the n-1 th convolution block.

Compared with the existing U-Net, the U-Net according to the embodiment of the invention fuses the high-level features of the subsequent convolution modules at each convolution module (or stage), so that the shallow features can also have higher semantic information. Therefore, the difference between the shallow feature and the deep feature in the aspect of semantic information is not large as before, and the shallow feature and the deep feature can be better fused, so that the processing effect of the whole network can be improved. According to the semantic segmentation method provided by the embodiment of the invention, as the improved U-Net is adopted, a more accurate semantic segmentation result can be obtained.

Illustratively, the semantic segmentation method according to embodiments of the present invention may be implemented in a device, apparatus or system having a memory and a processor.

The semantic segmentation method can be deployed at an image acquisition end, for example, the semantic segmentation method can be deployed at the image acquisition end of an access control system in the field of security application; in the field of financial applications, it may be deployed at personal terminals such as smart phones, tablets, personal computers, and the like.

Alternatively, the semantic segmentation method according to the embodiment of the present invention may also be distributively deployed at a server side (or a cloud side) and a personal terminal side. For example, an image may be acquired at a client, and the client transmits the acquired image to a server (or a cloud), and the server (or the cloud) performs semantic segmentation.

According to the embodiment of the present invention, the semantic segmentation method 400 may further include: acquiring a training image and corresponding classification marking data, wherein the classification marking data is used for indicating the probability that the training image belongs to at least one preset class; inputting training images into a U-shaped network; for each of one or more convolution modules in the n convolution modules, inputting the output characteristics of the convolution module into a semantic supervision module corresponding to the convolution module so as to obtain a classification result of a training image output by the semantic supervision module; for each of one or more convolution modules in the n convolution modules, calculating a classification loss corresponding to the convolution module based on the classification result and the classification label data; calculating a total loss based on the classification losses corresponding to the one or more convolution modules; the U-type network is optimized based on the total loss to obtain a trained U-type network.

Those skilled in the art will appreciate that the contraction path of U-Net can be used for classification functions, and together with the expansion path, can be used for semantic segmentation functions. In training U-Net, the systolic path (i.e., the underlying model for classification) of U-Net can be pre-trained first, e.g., pre-trained for five convolution modules, conv-1, res-2, res-3, res-4, and res-5. After the contraction path of the U-Net is trained, the pre-trained network parameters can be used as initialization parameters, and the expansion path can be added to train the semantic segmentation task for the whole U-Net.

Features near the last missing signal have stronger semantic information when pre-training the underlying model for classification. The traditional classification network only adds a loss function at the end, and some visual work shows that the characteristics of the back-layer network are characterized by higher semantic information, while the characteristics of the front-layer network are more characterized by the edge corner of low semantic information. For five convolution modules, conv-1, res-2, res-3, res-4 and res-5 of U-Net, the semantic information of the feature output by the convolution module is stronger after the convolution module, and the resolution of the feature is relatively lower.

Based on the above theory, extra loss (extra loss) can be added to the place where the previous convolution module outputs the features when training the basic model, so as to force to implant higher semantic information in the features of the previous layer.

Fig. 8 shows a schematic diagram of a partial network structure of U-Net in a training phase according to an embodiment of the invention. As shown in fig. 8, for the four convolution modules res-2 to res-5, a Semantic Supervision (SS) module may be added after each convolution module.

Illustratively, the semantic supervisor module corresponding to each of the one or more convolution modules may include two convolution layers, a global pooling layer, a fully-connected layer, and a classification function layer.

Fig. 9 shows a schematic diagram of a network structure of an SS module according to one embodiment of the present invention. As shown in fig. 9, the signature graph output by the convolution module may be input to the corresponding SS module. In the SS module, the 3 × 3 convolution, global pooling, full concatenation may be performed twice in sequence, then the features output by the full concatenation layer may be input to the classification function layer (not shown in fig. 9), and finally the features output by the classification function layer may be concatenated to the classification loss function to calculate the classification loss. Illustratively, the classification function layer may be a softmax layer. Illustratively, the classification loss function employed to calculate the classification loss may be a cross-entropy loss function.

During training, training of the basic model can be performed based on the ImageNet image classification task, namely, the training images can be selected from the ImageNet image data set. The classification function layer after each convolution module may output a probability that the training image belongs to a respective predetermined class. The classification label data (ground route) is used to indicate the class to which the training image actually belongs, which may be a one-hot vector. For example, in the classification labeling data, the element corresponding to the class to which the training image belongs may be set to 1, and the remaining elements may be set to 0.

Those skilled in the art can understand the manner of calculating the classification loss based on the classification result and the classification label data of each SS module, which is not described herein in detail. Illustratively, the classification losses corresponding to one or more convolution modules may be averaged to obtain an overall loss. For example, if an SS module is connected after all four convolution modules, res-2, res-3, res-4, and res-5, then four classification losses can be calculated and averaged to obtain the total loss.

Subsequently, the parameters of U-Net can be updated to minimize the total loss until convergence, and finally a trained U-Net can be obtained.

It should be noted that the SS module is mainly used in the training phase of U-Net, and the SS module connected after each convolution module can be eliminated in the practical application phase of U-Net.

The network layers included in the SS modules connected behind each convolution module are mainly shallow networks, that is, the number of layers included is small, for example, the convolution layer is only one layer and two layers, and the total number of layers is not large by adding subsequent layers such as pooling, full connection, softmax and the like. Since the classification result is obtained after each SS module, after training, the features output by the convolution module accessed to the SS module are closer to the final classification result, that is, the semantic information included in the features output by the convolution module is stronger. Therefore, the semantic information of the features output by the convolution module connected with the SS module can be improved by the method. Therefore, the method can further improve the semantic information of the shallow feature and further reduce the interval between the front-layer feature and the rear-layer feature.

The number of convolution modules accessing the SS module during training can be set according to requirements, and for example, it can be one or more of the five convolution modules of conv-1, res-2, res-3, res-4 and res-5. As described above, the number of layers of the 1 st convolution module conv-1 is small, and simple convolution is mainly performed on the input image, so that the 1 st convolution module is far from the final classification result, the 1 st convolution module can be selected to be ignored as required, and one or more convolution modules are selected from the 2 nd convolution module to the nth convolution module to be accessed to the SS module for training.

In addition, the structure of the SS module shown in fig. 9 is only an example and not a limitation of the present invention, and the SS module may have other suitable structures as long as it can implement the classification function. Preferably, the SS module includes fewer network layers, for example, it may include no more than 5 convolutional layers, so that the total network layers are fewer, and thus, the features output by the corresponding convolutional module are closer to the final classification result, so that the semantic information included in the features output by the convolutional module is stronger.

According to the embodiment of the invention, for each of at least one of the 2 nd convolution module to the n-1 th convolution module of the n convolution modules, the ratio of the number of residual convolution units included in the convolution module to the total number of residual convolution units included in the n convolution modules is larger than the preset ratio corresponding to the convolution module.

In the case of building U-Net based on a known residual network (e.g., ResNet or ResNeXt), the number of residual convolution units (blocks) included in the convolution modules of the known residual network may be rearranged such that the number of blocks included in each of at least one of the 2 nd convolution module to the n-1 th convolution module increases. Through rearrangement, for each of the 2 nd to n-1 th convolution modules of the known residual error network, the ratio of the number of blocks included in the convolution module to the total number of blocks included in the n convolution modules is larger than the preset ratio corresponding to the convolution module.

In the conventional ResNet or ResNeXt model, the four convolution modules conv2, conv3, conv4 and conv5 each include several blocks. Referring back to fig. 6, the parameters of the respective network structures are shown below the two columns of ResNet-50 and ResNeXt-50, respectively. In fig. 6, the parameter inside the middle bracket is a parameter of a single block, and the parameter outside the middle bracket is the number of blocks included in each volume module. As shown in fig. 6, in the conventional ResNet-50 model and ResNeXt-50 model, the numbers of blocks included in the four convolution modules conv2, conv3, conv4 and conv5 are 3,4, 6 and 3.

Blocks included in a convolution module of a traditional ResNet model or ResNeXt model can be rearranged, so that semantic information of shallow features is improved. After the rearrangement, the proportion of the block number included in the former convolution module to the total block number is increased, for example, increased to be larger than the preset proportion. For example, in the case where U-Net is constructed based on the ResNet-50 model or the ResNeXt-50 model, the preset ratio corresponding to conv2 may be 20% or more, the preset ratio corresponding to conv3 may be 25% or more, and the preset ratio corresponding to conv4 may be 37.5% or more. For another example, in the case where U-Net is constructed based on the ResNet-101 model or the ResNeXt-101 model, the preset ratio corresponding to conv2 may be 10% or more, the preset ratio corresponding to conv3 may be 10% or more, and the preset ratio corresponding to conv4 may be 70% or more.

For example, in the conventional ResNet-101 model or ResNeXt-101 model, the distribution of the number of blocks included in the conv2, conv3, conv4 and conv5 convolution modules is 3,4, 23 and 3, which is considered to be beneficial to the classification task. Now, in order to increase semantic information of the previous layer features and make the places where the features are led out by the shortcuts of the previous layer have stronger semantic expression capability, block distribution can be rearranged into 8, 8, 9 and 8. Thus, the semantic representation ability of the position where the front layer leads out the features is stronger, and the features with stronger semantic information can be obtained.

In the case of building U-Net based on a known residual network (e.g., ResNet or ResNeXt), it may be desirable to first increase the proportion of blocks that are included in the convolution module earlier in comparison.

In one example, the number of blocks included by the convolution modules of the known residual network may be reordered such that the number of blocks included by the 2 nd convolution module is increased. For example, the number of blocks included in conv2 may be increased and the number of blocks of the remaining convolution modules may be kept constant or decreased. In this case, a ratio between the number of blocks included in the 2 nd convolution module of the n convolution modules and the total number of blocks included in the n convolution modules is greater than a preset ratio.

In another example, the number of blocks included by the convolution modules of the known residual network may be rearranged such that the number of blocks included by the 2 nd convolution module and the 3 rd convolution module is increased. For example, the number of blocks included in conv2 and conv3 may be increased at the same time, and the number of blocks of the remaining convolution modules may be kept unchanged or decreased. In this case, for each of the 2 nd convolution module and the 3 rd convolution module of the n convolution modules, a ratio between the number of blocks included in the convolution module and a total number of blocks included in the n convolution modules is greater than a preset ratio corresponding to the convolution module.

In U-Net according to an embodiment of the present invention, different blocks included in the same convolution module may have the same network parameter (e.g., convolution kernel size) therebetween, and the block parameters included in different convolution modules may be different. In U-Net according to an embodiment of the present invention, a certain convolution module performs down-sampling, i.e., a reduction in the number of channels, compared to the previous convolution module adjacent to the certain convolution module. Illustratively, the block in each convolution module in U-Net according to embodiments of the present invention may employ network parameters that are consistent with the block of the corresponding convolution module in existing U-Net. For example, following the above example, the block distribution of the four convolution modules res-2, res-3, res-4 and res-5 in U-Net according to the embodiment of the present invention may be rearranged to 8, 8, 9, 8, where the network parameters of 8 blocks included in res-2 may be consistent with the network parameters of 3 blocks included in conv2 in the existing ResNet-101 model or the ResNeXt-101 model.

Practice proves that the shallow feature and the posterior feature can be better fused by the method provided by the embodiment of the invention. Fig. 10 shows the segmentation results obtained with a given feature level for the presence of U-Net and U-Net (denoted by ExFuse) according to an embodiment of the invention, respectively. Both the conventional U-Net shown in FIG. 10 and the U-Net according to an embodiment of the present invention are constructed based on ResNeXt-101. As shown in fig. 10, for U-Net according to an embodiment of the present invention, there is a 1.3 rising point connecting the first two shortcuts. The better feature fusion also enables the U-Net model provided by the embodiment of the invention to obtain great performance improvement on the benchmark dataset. Fig. 11 shows the results of performance tests performed on the PASCAL VOC 2012 validation set by various existing semantic segmentation networks and U-Net (denoted by ExFuse) according to an embodiment of the present invention, respectively. In fig. 11, the performance of each network is measured by mlou. As shown in FIG. 11, the performance of U-Net according to the embodiment of the present invention reaches 86.8% mIoU, which exceeds the other prior art methods shown in FIG. 11, and reaches the leading level in the art.

According to another aspect of the present invention, a semantic segmentation apparatus is provided. FIG. 12 shows a schematic block diagram of a semantic segmentation apparatus 1200 according to one embodiment of the present invention.

As shown in fig. 12, the semantic segmentation apparatus 1200 according to the embodiment of the present invention includes a first obtaining module 1210 and a first input module 1220. The various modules may perform the various steps/functions of the semantic segmentation method described above in connection with fig. 3-11, respectively. Only the main functions of the components of the semantic segmentation apparatus 1200 will be described below, and details that have been described above will be omitted.

The first obtaining module 1210 is used for obtaining an image to be processed. The first obtaining module 1210 may be implemented by the processor 302 in the electronic device shown in fig. 3 executing program instructions stored in the storage 304.

The first input module 1220 is configured to input the image to be processed into a U-type network to obtain a semantic segmentation result of the image to be processed output by the U-type network, where a contraction path of the U-type network includes n convolution modules connected in sequence, an output feature of an i-th convolution module of the n convolution modules is combined with an output feature of at least one convolution module after the i-th convolution module, and the combined feature jumps and is connected to an output end of an deconvolution layer corresponding to the i-th convolution module in an expansion path of the U-type network, where n is an integer greater than 1, and i is greater than or equal to 1 and less than or equal to n. The first input module 1220 may be implemented by the processor 302 in the electronic device shown in fig. 3 executing program instructions stored in the storage 304.

Exemplarily, the semantic segmentation apparatus 1200 further includes: a second obtaining module (not shown) for obtaining the training images and corresponding classification label data, the classification label data being used for indicating the probability that the training images belong to at least one predetermined category; a second input module (not shown) for inputting the training images into the U-type network; a third input module (not shown) for inputting, for each of one or more convolution modules of the n convolution modules, the output characteristics of the convolution module into the semantic supervisor module corresponding to the convolution module to obtain the classification result of the training image output by the semantic supervisor module; a first calculating module (not shown) for calculating, for each of one or more of the n convolution modules, a classification loss corresponding to the convolution module based on the classification result and the classification label data; a second calculation module (not shown) for calculating a total loss based on the classification losses corresponding to the one or more convolution modules; an optimization module (not shown) for optimizing the U-type network based on the total loss to obtain a trained U-type network.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

FIG. 13 shows a schematic block diagram of a semantic segmentation system 1300 in accordance with one embodiment of the present invention. The semantic segmentation system 1300 includes an image capture device 1310, a storage device (i.e., memory) 1320, and a processor 1330.

The image capturing device 1310 is used for capturing an image. The image capture device 1310 is optional and the semantic segmentation system 1300 may not include the image capture device 1310. In this case, the other image capturing device may be used to capture an image and send the captured image to the semantic segmentation system 1300.

The storage 1320 stores computer program instructions for implementing the corresponding steps in the semantic segmentation method according to an embodiment of the present invention.

The processor 1330 is configured to execute the computer program instructions stored in the storage 1320 to perform the corresponding steps of the semantic segmentation method according to the embodiment of the present invention.

In one embodiment, the computer program instructions, when executed by the processor 1330, are for performing the steps of: acquiring an image to be processed; inputting the image to be processed into a U-type network to obtain a semantic segmentation result of the image to be processed output by the U-type network, wherein a contraction path of the U-type network comprises n convolution modules which are connected in sequence, the output characteristic of the ith convolution module in the n convolution modules is combined with the output characteristic of at least one convolution module after the ith convolution module, and the combined characteristic is jumped and connected to the output end of an deconvolution layer corresponding to the ith convolution module in an expansion path of the U-type network, wherein n is an integer larger than 1, and i is more than or equal to 1 and less than n.

Illustratively, the computer program instructions, when executed by the processor 1330, are further configured to perform the steps of: acquiring a training image and corresponding classification marking data, wherein the classification marking data is used for indicating the probability that the training image belongs to at least one preset class; inputting training images into a U-shaped network; for each of one or more convolution modules in the n convolution modules, inputting the output characteristics of the convolution module into a semantic supervision module corresponding to the convolution module so as to obtain a classification result of a training image output by the semantic supervision module; for each of one or more convolution modules in the n convolution modules, calculating a classification loss corresponding to the convolution module based on the classification result and the classification label data; calculating a total loss based on the classification losses corresponding to the one or more convolution modules; the U-type network is optimized based on the total loss to obtain a trained U-type network.

Furthermore, according to an embodiment of the present invention, there is also provided a storage medium on which program instructions are stored, which when executed by a computer or a processor are used for executing the corresponding steps of the semantic segmentation method according to the embodiment of the present invention and for implementing the corresponding modules in the semantic segmentation apparatus according to the embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

In one embodiment, the program instructions, when executed by a computer or a processor, may cause the computer or the processor to implement the respective functional modules of the semantic segmentation apparatus according to the embodiment of the present invention and/or may perform the semantic segmentation method according to the embodiment of the present invention.

In one embodiment, the program instructions are operable when executed to perform the steps of: acquiring an image to be processed; inputting the image to be processed into a U-type network to obtain a semantic segmentation result of the image to be processed output by the U-type network, wherein a contraction path of the U-type network comprises n convolution modules which are connected in sequence, the output characteristic of the ith convolution module in the n convolution modules is combined with the output characteristic of at least one convolution module after the ith convolution module, and the combined characteristic is jumped and connected to the output end of an deconvolution layer corresponding to the ith convolution module in an expansion path of the U-type network, wherein n is an integer larger than 1, and i is more than or equal to 1 and less than n.

Illustratively, the program instructions are further operable when executed to perform the steps of: acquiring a training image and corresponding classification marking data, wherein the classification marking data is used for indicating the probability that the training image belongs to at least one preset class; inputting training images into a U-shaped network; for each of one or more convolution modules in the n convolution modules, inputting the output characteristics of the convolution module into a semantic supervision module corresponding to the convolution module so as to obtain a classification result of a training image output by the semantic supervision module; for each of one or more convolution modules in the n convolution modules, calculating a classification loss corresponding to the convolution module based on the classification result and the classification label data; calculating a total loss based on the classification losses corresponding to the one or more convolution modules; the U-type network is optimized based on the total loss to obtain a trained U-type network.

The modules in the semantic segmentation system according to the embodiment of the present invention may be implemented by a processor of an electronic device implementing semantic segmentation according to the embodiment of the present invention running computer program instructions stored in a memory, or may be implemented when computer instructions stored in a computer readable storage medium of a computer program product according to the embodiment of the present invention are run by a computer.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the foregoing illustrative embodiments are merely exemplary and are not intended to limit the scope of the invention thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present invention should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some of the modules in a semantic segmentation apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiment of the present invention or the description thereof, and the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of semantic segmentation, comprising:

acquiring an image to be processed;

inputting the image to be processed into a U-type network to obtain a semantic segmentation result of the image to be processed output by the U-type network, wherein a contraction path of the U-type network comprises n convolution modules which are connected in sequence, the output characteristic of the i-th convolution module in the n convolution modules is combined with the output characteristic of at least one convolution module after the i-th convolution module, and the combined characteristic is connected to the output end of an deconvolution layer corresponding to the i-th convolution module in an expansion path of the U-type network in a jumping mode, wherein n is an integer larger than 1, and i is larger than or equal to 1 and smaller than or equal to n.

2. The method of claim 1, wherein, for each of at least one convolution module following the i-th convolution module, the output characteristic of that convolution module is input to a convolution layer, the output characteristic of the convolution layer is input to an upsampling layer, the output characteristics of all upsampling layers corresponding to the at least one convolution module following the i-th convolution module are element-wise multiplied with the output characteristic of the i-th convolution module, and the result of the element-wise multiplication is the combined characteristic.

3. The method of claim 1, wherein the method further comprises:

acquiring a training image and corresponding classification marking data, wherein the classification marking data is used for indicating the probability that the training image belongs to at least one preset class;

inputting the training images into the U-shaped network;

for each of one or more of the n convolution modules,

inputting the output characteristics of the convolution module into a semantic supervision module corresponding to the convolution module to obtain the classification result of the training image output by the semantic supervision module;

calculating the classification loss corresponding to the convolution module based on the classification result and the classification marking data;

calculating a total loss based on the classification losses corresponding to the one or more convolution modules;

optimizing the U-shaped network based on the total loss to obtain the trained U-shaped network.

4. The method of claim 3, wherein the semantic supervisor module corresponding to each of the one or more convolution modules comprises two convolution layers, a global pooling layer, a fully-connected layer, and a classification function layer.

5. The method of claim 1, wherein, for each of at least one of the 2 nd to n-1 th convolution modules, a ratio between a number of residual convolution units included in the convolution module and a total number of residual convolution units included in the n convolution modules is greater than a preset ratio corresponding to the convolution module.

6. The method of claim 1, wherein n is equal to 5, and the 2 nd to nth ones of the n convolution modules include a number of residual convolution units of 8, 8, 9, 8, respectively.

7. A semantic segmentation apparatus comprising:

the first acquisition module is used for acquiring an image to be processed;

the first input module is used for inputting the image to be processed into a U-type network so as to obtain a semantic segmentation result of the image to be processed output by the U-type network, wherein a contraction path of the U-type network comprises n convolution modules which are connected in sequence, the output characteristic of the i-th convolution module in the n convolution modules is combined with the output characteristic of at least one convolution module after the i-th convolution module, and the combined characteristic jumps to be connected to the output end of the deconvolution layer corresponding to the i-th convolution module in an expansion path of the U-type network, wherein n is an integer larger than 1, and i is more than or equal to 1 and less than or equal to n.

8. A semantic segmentation system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for execution by the processor for performing the semantic segmentation method according to any one of claims 1 to 6.

9. A storage medium on which are stored program instructions for performing, when executed, the semantic segmentation method according to any one of claims 1 to 6.