CN115690787A

CN115690787A - Semantic segmentation method, image processing apparatus, and computer-readable storage medium

Info

Publication number: CN115690787A
Application number: CN202211390885.1A
Authority: CN
Inventors: 陈勇全; 于佳辉; 马天宇; 黄锐; 郝祁; 马睿; 曾祥容
Original assignee: Chinese University of Hong Kong Shenzhen; Shenzhen Institute of Artificial Intelligence and Robotics
Current assignee: Chinese University of Hong Kong Shenzhen; Shenzhen Institute of Artificial Intelligence and Robotics
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-02-03

Abstract

The embodiment of the application discloses a semantic segmentation method, image processing equipment and a computer readable storage medium, which are used for performing semantic segmentation on an original image under the condition of improving segmentation accuracy. The method of the embodiment of the application comprises the following steps: after an original image is obtained, inputting the original image into a multi-scale feature extraction module of a semantic segmentation model to obtain a first feature map of N-level scales of the original image output by the multi-scale feature extraction module; n is an integer greater than or equal to 2; the semantic segmentation model is a pre-trained model, the first feature maps of N-level scales are respectively input into a target fusion module of the semantic segmentation model, the target fusion module performs at least one fusion processing on the first feature maps of the N-level scales to obtain target fusion feature maps of the N-level scales output by the target fusion module, and a semantic segmentation image corresponding to an original image is generated according to the target fusion feature maps of the N-level scales.

Description

Semantic segmentation method, image processing apparatus, and computer-readable storage medium

Technical Field

The embodiment of the application relates to the field of image processing, in particular to a semantic segmentation method, an image processing device and a computer readable storage medium.

Background

At present, the automobile industry is in a revolutionary era, automatic driving is the function that a vehicle can drive by itself under the condition of no driver operation, and related technology development and application are fierce and have various development paths, such as single vehicle intelligence, vehicle path cooperation, networking cloud control and the like. The automatic driving technology senses the conditions of the vehicle and the surrounding environment in real time through a sensor, carries out planning decision through an intelligent system, and finally executes driving operation through a control system. Although many scientific researchers are in research in related technical industries, the segmentation field of automatic driving has many defects, such as too slow recognition of moving objects, too low accuracy, inaccurate prediction of static objects such as lane lines, etc., which may cause serious traffic accidents and harm personal and property safety of people. Therefore, the comprehensiveness and accuracy of image feature extraction in the semantic segmentation of the automatic driving image play a key role in the whole automobile industry.

The existing semantic segmentation method is a semantic segmentation method of a local Pyramid Occupied Network (PON), wherein the local Pyramid Occupied Network (PON) uses a residual error module as a trunk feature extraction network, and comprises a Transformer module which operates under a plurality of image scales in a deep convolutional neural network architecture.

However, the comprehensiveness and accuracy of the transform module for extracting the image features are poor, resulting in low segmentation accuracy, i.e. the cross-over ratio (IOU) of the semantic segmentation image output by the model and the semantic segmentation image of the label is low.

Disclosure of Invention

The embodiment of the application provides a semantic segmentation method, an image processing device and a computer readable storage medium, which are used for performing semantic segmentation on an original image under the condition of improving the intersection ratio of a semantic segmentation image output by a model and a semantic segmentation image of a label.

In a first aspect, an embodiment of the present application provides a semantic segmentation method, including:

after an original image is obtained, inputting the original image into a multi-scale feature extraction module of a semantic segmentation model to obtain a first feature map of N-level scale of the original image output by the multi-scale feature extraction module; the N is an integer greater than or equal to 2; the semantic segmentation model is a pre-trained model;

respectively inputting the first feature maps of the N-level scales into a target fusion module of the semantic segmentation model, and obtaining the target fusion feature maps of the N-level scales output by the target fusion module after the target fusion module performs at least one fusion process on the first feature maps of the N-level scales;

and generating a semantic segmentation image corresponding to the original image according to the target fusion feature map with the N-level scale.

Optionally, before the original image is input into the multi-scale feature extraction module of the semantic segmentation model, the method further includes:

preprocessing the original image to obtain a preprocessed original image;

the multi-scale feature extraction module for inputting the original image into the semantic segmentation model comprises:

and inputting the preprocessed original image into the multi-scale feature extraction module.

Optionally, the target fusion module includes a first fusion module and a second fusion module;

after the target fusion module performs at least one fusion process on the first feature map with the N-level scale, the target fusion feature map with the N-level scale output by the target fusion module is obtained, which includes:

performing first fusion processing on the first feature map of the N-level scale by the first fusion module to obtain a first fusion feature map of the N-level scale output by the first fusion module;

performing second fusion processing on the first fusion feature map of the N-level scale by the second fusion module to obtain a second fusion feature map of the N-level scale output by the second fusion module; and the second fusion feature map of the N-level scale is the target fusion feature map of the N-level scale.

Optionally, the performing, by the first fusion module, the first fusion processing on the first feature map of the N-level scale to obtain the first fusion feature map of the N-level scale output by the first fusion module includes:

for the first fusion feature map of the (n-1) th-level scale, inputting the first feature map of the (n) th-level scale into an up-sampling module of the semantic segmentation model, and performing up-sampling processing by the up-sampling module to obtain an up-sampling feature map of the (n) th-level scale; wherein N is more than or equal to 2 and less than or equal to N;

performing, by the first fusion module, first fusion processing on the upsampling feature map of the nth-level scale and the first feature map of the nth-1-level scale to obtain a first fusion feature map of the nth-1-level scale output by the first fusion module;

and regarding the first fusion feature map of the nth scale, taking the first feature map of the nth scale as the first fusion feature map of the nth scale.

Optionally, the performing, by the second fusion module, the second fusion processing on the first fusion feature map of the N-level scale to obtain a second fusion feature map of the N-level scale output by the second fusion module includes:

regarding the second fused feature map of the 1 st-level scale, taking the first fused feature map of the 1 st-level scale as the second fused feature map of the 1 st-level scale;

for the second fusion feature map of the nth scale, inputting the first feature map of the (n-1) th scale into a downsampling module of the semantic segmentation model, and carrying out downsampling processing by the downsampling module to obtain a downsampling feature map of the (n-1) th scale; wherein N is more than or equal to 2 and less than or equal to N;

performing, by the second fusion module, second fusion processing on the down-sampled feature map of the (n-1) th scale and the first fusion feature map of the (n) th scale to obtain a second fusion feature map of the (n) th scale output by the second fusion module;

optionally, the generating a semantic segmentation image corresponding to the original image according to the target fusion feature map of the N-level scale includes:

inputting the target fusion feature map with the N-level scale into a global convolution module of the semantic segmentation model to obtain a second feature map with the N-level scale output by the global convolution module;

and generating the semantic segmentation image corresponding to the original image according to the second feature map of the N-level scale.

Optionally, the generating the semantic segmentation image corresponding to the original image according to the second feature map of the N-level scale includes:

inputting the second feature map of the N-level scale into an edge refining module of the semantic segmentation model to obtain a third feature map of the N-level scale output by the edge refining module;

and generating the semantic segmentation image corresponding to the original image according to the third feature map.

obtaining an original image sample; wherein, each original image sample is marked with a semantic segmentation image;

inputting the original image sample into a semantic segmentation model to obtain a predicted semantic segmentation image corresponding to the original image sample output by the semantic segmentation model;

and calculating the loss between the predicted semantic segmentation image and the annotated semantic segmentation image according to a regression loss function, and obtaining a trained semantic segmentation model when the loss meets a convergence condition.

In a second aspect, an embodiment of the present application provides an image processing apparatus, including:

the system comprises a central processing unit, a memory, an input/output interface, a wired or wireless network interface and a power supply;

the memory is a transient storage memory or a persistent storage memory;

the central processor is configured to communicate with the memory and execute the instruction operations in the memory to perform the aforementioned semantic segmentation method.

In a third aspect, embodiments of the present application provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the foregoing. . A method.

In a fourth aspect, the present application provides a computer program product containing instructions for causing a computer to perform the aforementioned semantic segmentation method when the computer program product runs on the computer.

According to the technical scheme, the embodiment of the application has the following advantages: after the first feature map of the original image with the N-level scale is obtained, the first feature map of the original image with the N-level scale is respectively input into a target fusion module of the semantic segmentation model, and the target fusion module performs at least one fusion processing on the first feature map of the original image with the N-level scale, so that the comprehensiveness and the accuracy of image feature extraction are improved, and the intersection and parallel ratio of the semantic segmentation image output by the model and the semantic segmentation image of the label is improved.

Drawings

Fig. 1 is a schematic flow chart of a semantic segmentation method disclosed in an embodiment of the present application;

FIG. 2 is a schematic flowchart of a method for obtaining a semantic segmentation image by inputting an original image into a semantic segmentation model for semantic segmentation, disclosed in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a global convolution module disclosed in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an edge refining module disclosed in an embodiment of the present application;

fig. 5 is a schematic structural diagram of another image processing apparatus disclosed in the embodiment of the present application.

Detailed Description

Referring to fig. 1, fig. 1 is a schematic flow chart of a semantic segmentation method disclosed in an embodiment of the present application, where the method includes:

101. after an original image is obtained, inputting the original image into a multi-scale feature extraction module of a semantic segmentation model to obtain a first feature map of N-level scales of the original image output by the multi-scale feature extraction module; n is an integer greater than or equal to 2; the semantic segmentation model is a pre-trained model.

In the embodiment, when performing semantic segmentation, an original image can be obtained, and after the original image is obtained, the original image can be input into a multi-scale feature extraction module of a semantic segmentation model to obtain a first feature map of an N-level scale of the original image output by the multi-scale feature extraction module; n is an integer greater than or equal to 2; the semantic segmentation model is a pre-trained model.

102. And respectively inputting the first feature maps with the N-level scales into a target fusion module of the semantic segmentation model, and performing fusion processing on the first feature maps with the N-level scales at least once by using the target fusion module to obtain the target fusion feature maps with the N-level scales output by the target fusion module.

After the first feature map of the original image of the N-level scale output by the multi-scale feature extraction module is obtained, the first feature map of the N-level scale may be input into the target fusion module of the semantic segmentation model, and the target fusion module performs at least one fusion process on the first feature map of the N-level scale to obtain the target fusion feature map of the N-level scale output by the target fusion module. It can be understood that, the method of at least one time of fusion processing may be that the first feature map of N-level scale is first subjected to the first fusion processing to obtain the first fusion feature map of N-level scale, and then the first fusion feature map of N-level scale is subjected to the second fusion processing to obtain the second fusion feature map of N-level scale; the second fusion feature map of the N-level scale is a target fusion feature map of the N-level scale, and may also be another reasonable method for performing at least one fusion process, which is not limited herein.

103. And generating a semantic segmentation image corresponding to the original image according to the target fusion feature map with the N-level scale.

After the target fusion feature map of N-level scale output by the target fusion module is obtained, the semantic segmentation image corresponding to the original image can be generated according to the target fusion feature map of N-level scale.

In the embodiment of the application, after the first feature maps of N-level scales of the original image are obtained, the first feature maps of the N-level scales are respectively input into the target fusion module of the semantic segmentation model, and the target fusion module performs at least one fusion processing on the first feature maps of the N-level scales, so that the comprehensiveness and the accuracy of image feature extraction are improved, and the intersection ratio of the semantic segmentation image output by the model and the semantic segmentation image of the label is improved.

In this embodiment of the application, after the target fusion module performs at least one fusion process on the first feature map with the N-level scale, there may be multiple methods for obtaining the target fusion feature map with the N-level scale output by the target fusion module, and one of the methods is described below based on the semantic segmentation method shown in fig. 1.

In this embodiment, when performing semantic segmentation, an original image can be obtained. Specifically, the semantic segmentation means that many objects in the original image, such as drivable areas, lane lines, buses, cars, pedestrians, obstacles, and the like, are semantically segmented into semantic segmentation maps (bird's-eye views BEV). The method for obtaining the original image may be obtained from an automatically driven public large-scale data set (Nuscenes data set), the collection mode of the Nuscenes data set may be collection through a collection vehicle in 1000 scenes of different cities, wherein the collection vehicle is equipped with a perfect sensor, and the perfect sensor includes 6 Cameras (CAM), 1 LIDAR (LIDAR), 5 millimeter wave RADARs (RADAR), IMU and GPS, and may also be other reasonable methods for obtaining the original image, which is not limited herein.

After the original image is obtained, the original image may be preprocessed to obtain a preprocessed original image. Specifically, the preprocessing method may be normalization, scaling, and the like, where the normalization refers to converting the pixel value of the original image to 0-1, and the scaling refers to scaling the height and width of the input image using a resize () function, and it should be understood that preprocessing the original image may reduce the dependence of the network model on certain features, and improve the generalization and robustness of the model.

After the preprocessed original image is obtained, the preprocessed original image can be input into a multi-scale feature extraction module of a semantic segmentation model, and a first feature map of N-level scales of the original image output by the multi-scale feature extraction module is obtained; n is an integer greater than or equal to 2; the semantic segmentation model is a pre-trained model. It can be understood that the pre-trained semantic segmentation model may be an improved FPN novel pyramid occupation network semantic segmentation model, the multi-scale feature extraction module may be a residual module (ResNet 50 module), for example, the size of the preprocessed original image may be 3 × 196 × 200, the preprocessed original image of 3 × 196 × 200 is input into the ResNet50 module to obtain a first feature map of an N-scale of the original image output by the ResNet50 module, please refer to fig. 2, fig. 2 is a schematic flow chart of a method for semantic segmenting the original image input into the semantic segmentation model to obtain the semantic segmentation image disclosed in this embodiment, N is 5 in fig. 2, the first feature map of a 1-scale may be a first feature map of a 1/8-scale, the first feature map of a 2-scale may be a first feature map of a 1/16-scale, the first feature map of a 3-scale may be a first feature map of a 1/32-scale, the first feature map of a 4-scale may be a first feature map of a 1/16-scale, and the first feature map of a smaller scale may be a size of a first feature map, and the size is a lower than a first feature map of a scale, and the size is a first feature map of a second scale, which is a lower than a scale, and a scale, the size is a scale of a second scale is a scale, and a scale.

After the first feature maps of the N-level scales of the original image output by the multi-scale feature extraction module are obtained, the first feature maps of the N-level scales can be respectively input into a target fusion module of the semantic segmentation model, and the target fusion module performs at least one fusion processing on the first feature maps of the N-level scales to obtain target fusion feature maps of the N-level scales output by the target fusion module.

The method for obtaining the target fusion feature map of the N-level scale output by the target fusion module after performing at least one fusion process on the first feature map of the N-level scale by the target fusion module may be that the first fusion module performs a first fusion process on the first feature map of the N-level scale to obtain a first fusion feature map of the N-level scale output by the first fusion module, and then the second fusion module performs a second fusion process on the first fusion feature map of the N-level scale to obtain a second fusion feature map of the N-level scale output by the second fusion module; the target fusion module comprises a first fusion module and a second fusion module, and the second fusion feature map of the N-level scale is a target fusion feature map of the N-level scale.

The first fusion module performs first fusion processing on the first feature map with the N-level scale to obtain a first fusion feature map with the N-level scale output by the first fusion module, and for the first fusion feature map with the N-1-level scale, the first feature map with the N-level scale is input into an upsampling module of a semantic segmentation model, and the upsampling module performs upsampling processing to obtain an upsampling feature map with the N-level scale; wherein N is more than or equal to 2 and less than or equal to N; then, a first fusion module carries out first fusion processing on the up-sampling feature map of the nth scale and the first feature map of the nth-1 scale to obtain a first fusion feature map of the nth-1 scale output by the first fusion module; and regarding the first fusion feature map of the nth scale, taking the first feature map of the nth scale as the first fusion feature map of the nth scale.

With reference to fig. 2, as can be seen from fig. 2, for the first fused feature map of the 1 st-level scale, the first feature map of the 2 nd-level scale (1/16 of the first feature map) may be input into an upsampling module of the semantic segmentation model, the upsampling module performs upsampling to obtain an upsampled feature map of the 2 nd-level scale (1/8 of the upsampled feature map obtained after the upsampling is performed on the first feature map of 1/16), and then the first fusing module performs first fusing to the upsampled feature map of the 2 nd-level scale and the first feature map of the 1 st-level scale to obtain the first fused feature map of the 1 st-level scale output by the first fusing module, where it can be understood that the method for obtaining the first fused feature map of the 2 nd, 3 th, and 4 th-level scales is similar to the method for calculating the first fused feature map of the 1 st-level scale, and is not described herein again. For the first fused feature map at the 5 th-order scale, the first feature map at the 5 th-order scale may be used as the first fused feature map at the 5 th-order scale.

The second fusion module performs second fusion processing on the first fusion feature map with the N-level scale to obtain a second fusion feature map with the N-level scale output by the second fusion module, where for the second fusion feature map with the 1 st-level scale, the first fusion feature map with the 1 st-level scale is used as the second fusion feature map with the 1 st-level scale; for the second fusion feature map of the nth scale, the first feature map of the (n-1) th scale is input into a down-sampling module of the semantic segmentation model, and the down-sampling module performs down-sampling processing to obtain a down-sampling feature map of the (n-1) th scale; wherein N is more than or equal to 2 and less than or equal to N; and then, a second fusion module carries out second fusion processing on the down-sampling feature map of the (n-1) th scale and the first fusion feature map of the (n) th scale to obtain a second fusion feature map of the (n) th scale output by the second fusion module.

With continuing reference to fig. 2, as can be seen from fig. 2, for the second fused feature map at the level 1 scale, the first fused feature map at the level 1 scale can be used as the second fused feature map at the level 1 scale; for the second fusion feature map of the 2 nd-level scale, the first feature map of the 1 st-level scale is input into a down-sampling module of the semantic segmentation model, and the down-sampling module performs down-sampling processing to obtain a down-sampling feature map of the 1 st-level scale; and then, a second fusion module performs second fusion processing on the downsampled feature map with the 1 st scale and the first fusion feature map with the 2 nd scale to obtain a second fusion feature map with the 2 nd scale output by the second fusion module, and the method for obtaining the second fusion feature maps with the 3 rd, 4 th and 5 th scales is similar to the method for obtaining the second fusion feature map with the 2 nd scale, and is not repeated here.

It is worth mentioning that the two times of fusion processing are carried out, the comprehensiveness and the accuracy of image feature extraction are improved, and the segmentation accuracy is improved, namely the intersection ratio of the semantic segmentation image output by the model and the semantic segmentation image of the label is improved.

The method for generating the semantic segmentation image corresponding to the original image according to the target fusion feature map of the N-level scale may be to input the target fusion feature map of the N-level scale into a global convolution module of a semantic segmentation model, and obtain a second feature map of the N-level scale output by the global convolution module. Please refer to fig. 3, fig. 3 is a schematic structural diagram of a global convolution module disclosed in an embodiment of the present application, where 14 in fig. 3 is a semantic category number, the global convolution module has two branches, respectively 1 × k + k × 1, it should be understood that k × k is replaced by this way, and a non-linear activation function is not connected to the global convolution module afterwards, which may reduce the amount of computation under the condition of ensuring a certain sense field, and may improve the detection speed on the premise of ensuring the accuracy, thereby improving the timeliness, and secondly, the global convolution module replaces full connection and global pooling, thereby reducing the possibility of losing positioning information and improving the positioning performance.

After the second feature map of the N-level scale output by the global convolution module is obtained, the second feature map of the N-level scale may be input into an edge refinement module of the semantic segmentation model to obtain a third feature map of the N-level scale output by the edge refinement module, and then a semantic segmentation image corresponding to the original image may be generated according to the third feature map. Referring to fig. 4, fig. 4 is a schematic structural diagram of an edge refinement module disclosed in an embodiment of the present application, fig. 4 is a structure of residual connection, the top w × h × 14 is a rough feature map, and residual connection at the side edge can improve the boundary of an identified object to achieve an edge refinement effect, thereby enhancing the IOU of a semantic segmentation image.

It is worth mentioning that before the original image is input into the multi-scale feature extraction module of the semantic segmentation model, the semantic segmentation model can be trained, and the method for training the semantic segmentation model can be to obtain an original image sample firstly; wherein, each original image sample is marked with a semantic segmentation image; and finally, calculating the loss between the predicted semantic segmentation image and the annotated semantic segmentation image according to a regression loss function, and obtaining the trained semantic segmentation model when the loss meets a convergence condition. Specifically, training can be performed by an Adam optimizer with a learning rate of 0.001.

It is worth mentioning that, compared with the local Pyramid Occupied Network (PON) in the prior art, the improved pyramid feature extraction network of the embodiment extracts semantics more sufficiently, and the generated result has higher accuracy, and a single detection can only take 0.2 second under the condition that the computer uses the GPU (NVIDIARTX 3090). Compared with a monocular semantic occupation mesh mapping (VED) method of a convolution V fractal encoder-decoder network, namely a method for predicting a semantic segmentation image by using an encoding and decoding network model with RGB video as input, the embodiment improves the comprehensiveness of feature extraction and the accuracy of semantic segmentation, and improves the semantic segmentation effect on dynamic objects, so that the method can be suitable for actual driving scenes. Compared with a cross-view semantic segmentation (VPN) method of a perception environment, namely a method for building surrounding objects through a view relation module to perform semantic segmentation, the embodiment improves the integrity of spatial information, ensures the accuracy of semantic segmentation of small targets, and also ensures the accuracy of semantic segmentation of large target points, thereby improving the accuracy of integral semantic segmentation and improving IOU.

It can be understood that, besides the above-described method of performing at least one fusion processing on the first feature map of the N-level scale by the target fusion module, the method of performing the first fusion processing on the first feature map of the N-level scale by the first fusion module, the method of performing the second fusion processing on the first fusion feature map of the N-level scale by the second fusion module, and the method of generating the semantic segmentation image corresponding to the original image according to the target fusion feature map of the N-level scale, other reasonable methods are also possible, and are not limited herein.

In this embodiment, after the N-level scale first feature map of the original image is obtained, the N-level scale first feature map is input to the target fusion module of the semantic segmentation model, and the target fusion module performs at least one fusion process on the N-level scale first feature map, so that the comprehensiveness and accuracy of image feature extraction are improved, and the segmentation accuracy is improved, that is, the intersection-parallel ratio (IOU) between the semantic segmentation image output by the model and the semantic segmentation image of the label is improved. Secondly, the target fusion feature map with the N-level scale can be input into a global convolution module of the semantic segmentation model for feature extraction, so that a second feature map with the N-level scale output by the global convolution module is obtained, the comprehensiveness and accuracy of image feature extraction are improved, and the problem of incomplete use of a feature layer in the aspect of a model is solved, so that the segmentation accuracy is improved, and the IOU is improved. Moreover, the second feature map of N-level scale output by the global convolution module can be input into the edge refinement module to obtain a third feature map of N-level scale output by the edge refinement module, so that the edge refinement degree is improved, and the segmentation accuracy is improved. Finally, a series of feature extraction is carried out through the multi-scale feature extraction module, the target fusion module, the global convolution module and the edge refinement module, the integrity of the semantic segmentation model can be improved, and the semantic segmentation speed can be improved while more semantic features are extracted.

Referring to fig. 5, an embodiment of the image processing apparatus 500 in the embodiment of the present application includes:

a central processing unit 501, a memory 505, an input/output interface 504, a wired or wireless network interface 503 and a power supply 502;

memory 505 is a transient storage memory or a persistent storage memory;

the central processor 501 is configured to communicate with the memory 505 and execute the instructions in the memory 505 to perform the methods described in the embodiments of fig. 1.

The embodiment of the present application further provides a computer-readable storage medium, which includes instructions, when the instructions are executed on a computer, cause the computer to execute the method in the foregoing embodiment shown in fig. 1.

The embodiment of the present application also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method in the foregoing embodiment shown in fig. 1.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

Claims

1. A method of semantic segmentation, comprising:

after an original image is obtained, inputting the original image into a multi-scale feature extraction module of a semantic segmentation model to obtain a first feature map of N-level scales of the original image output by the multi-scale feature extraction module; the N is an integer greater than or equal to 2; the semantic segmentation model is a pre-trained model;

2. The method of claim 1, wherein prior to inputting the original image into a multi-scale feature extraction module of a semantic segmentation model, the method further comprises:

preprocessing the original image to obtain a preprocessed original image;

3. The method of claim 1, wherein the target fusion module comprises a first fusion module and a second fusion module;

4. The method according to claim 3, wherein the performing, by the first fusion module, the first fusion process on the first feature map of the N-level scale to obtain the first fused feature map of the N-level scale output by the first fusion module includes:

5. The method according to claim 4, wherein the second fusing module performs a second fusing process on the first fused feature map with the N-level scale to obtain a second fused feature map with the N-level scale output by the second fusing module, and the method includes:

regarding the second fusion feature map of the 1 st-level scale, taking the first fusion feature map of the 1 st-level scale as the second fusion feature map of the 1 st-level scale;

for the second fusion feature map with the nth scale, inputting the first feature map with the (n-1) th scale into a down-sampling module of the semantic segmentation model, and performing down-sampling processing by the down-sampling module to obtain a down-sampling feature map with the (n-1) th scale; wherein N is more than or equal to 2 and less than or equal to N;

and performing second fusion processing on the down-sampling feature map of the (n-1) th level scale and the first fusion feature map of the nth level scale by the second fusion module to obtain a second fusion feature map of the nth level scale output by the second fusion module.

6. The method according to claim 1, wherein the generating of the semantic segmentation image corresponding to the original image according to the target fusion feature map of the N-level scale comprises:

inputting the target fusion feature map of the N-level scale into a global convolution module of the semantic segmentation model to obtain a second feature map of the N-level scale output by the global convolution module;

and generating the semantic segmentation image corresponding to the original image according to the second feature map with the N-level scale.

7. The method according to claim 6, wherein the generating the semantic segmentation image corresponding to the original image according to the second feature map of the N-level scale comprises:

8. The method of claim 1, wherein prior to inputting the original image into a multi-scale feature extraction module of a semantic segmentation model, the method further comprises:

9. An image processing apparatus characterized by comprising:

the memory is a transient storage memory or a persistent storage memory;

the central processor is configured to communicate with the memory and execute the instructions in the memory to perform the method of any of claims 1 to 8.

10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 8.