CN113159057B

CN113159057B - Image semantic segmentation method and computer equipment

Info

Publication number: CN113159057B
Application number: CN202110353991.1A
Authority: CN
Inventors: 王改华; 翟乾宇; 甘鑫; 曹清程
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2022-09-02
Anticipated expiration: 2041-04-01
Also published as: CN113159057A

Abstract

The invention provides an image semantic segmentation method and computer equipment, wherein the image semantic segmentation method comprises the following steps: inputting an image to be processed into a lightweight neural network to obtain a lightweight characteristic diagram; inputting the lightweight feature map into the enhanced pyramid network to obtain a spliced feature map; inputting the spliced characteristic graphs into a classification network to obtain a plurality of classification characteristic graphs; for each classification feature map, inputting the classification feature map into a bar-type attention network to obtain an attention feature map corresponding to the classification feature map, and adding the classification feature map and the attention feature map to obtain a semantic feature map corresponding to the classification feature map; and determining a semantic segmentation result according to the plurality of semantic feature maps. The invention reduces the calculated amount, ensures the precision of image semantic segmentation, and is suitable for terminals with limited hardware resources.

Description

Image semantic segmentation method and computer equipment

Technical Field

The present application relates to the field of image processing, and in particular, to a method and a computer device for semantic segmentation of an image.

Background

Under the development background of deep learning, a convolutional neural network is accepted by more and more people, the application is more and more common, and the image semantic segmentation is widely applied to the fields of automatic driving, medical image diagnosis, remote sensing image analysis and the like. However, the general trend in deep learning is to achieve higher accuracy through deeper and more complex networks, but these deeper and more complex networks are generally not dominant in the size and running speed of the model; in real life, a plurality of terminals with limited hardware resources exist, image semantic segmentation cannot be applied to the terminals with limited hardware resources, and development and application of the image semantic segmentation are limited.

Therefore, the prior art is in need of improvement.

Disclosure of Invention

The invention aims to solve the technical problems that the existing image semantic segmentation method has complex model structure, large calculation amount and large occupied hardware resources, and cannot be suitable for terminals with limited hardware resources. The image semantic segmentation method and the computer equipment are provided, the number of parameters used for calculation is reduced, the precision of image semantic segmentation is guaranteed, and the method and the computer equipment can be suitable for terminals with limited hardware resources.

In a first aspect, an embodiment of the present invention provides an image semantic segmentation method, which is applied to a semantic segmentation model, where the semantic segmentation model includes a lightweight neural network, an enhanced pyramid network, a classification network, and a bar-shaped attention network; the image semantic segmentation method comprises the following steps:

inputting an image to be processed into the lightweight neural network to obtain a lightweight characteristic diagram;

inputting the lightweight feature map into the enhanced pyramid network to obtain a spliced feature map;

inputting the spliced feature map into the classification network to obtain a plurality of classification feature maps;

for each classification feature map, inputting the classification feature map into the bar-shaped attention network to obtain an attention feature map corresponding to the classification feature map, and adding the classification feature map and the attention feature map to obtain a semantic feature map corresponding to the classification feature map;

and determining a semantic segmentation result according to the plurality of semantic feature maps.

As a further improved technical solution, the enhanced pyramid network includes: the system comprises a first pooling pyramid module, a second pooling pyramid module, a third pooling pyramid module, a fourth pooling pyramid module, a first global average pooling module and a second global average pooling module; inputting the lightweight feature map into the enhanced pyramid network to obtain a spliced feature map, specifically comprising:

respectively inputting the lightweight feature map into the first pooling pyramid module, the second pooling pyramid module, the third pooling pyramid module, the fourth pooling pyramid module, the first global average pooling module and the second global average pooling module, obtaining a first feature map through the first pooling pyramid module, obtaining a second feature map through the second pooling pyramid module, obtaining a third feature map through the third pooling pyramid module, obtaining a fourth feature map through the fourth pooling pyramid module, obtaining a fifth feature map through the first global average pooling module, and obtaining a sixth feature map through the second global average pooling module;

and splicing the first feature map, the second feature map, the third feature map, the fourth feature map, the fifth feature map and the sixth feature map in a channel dimension to obtain a spliced feature map.

As a further improved technical solution, the first pooling pyramid module includes: 1 x1 of a global average pooling layer and a first convolution layer; the second pooling pyramid module comprises: 2 x 2 global average pooling layers and a second convolution layer; the third pooling pyramid module comprises: 3 x 3 global average pooling layer and a third convolution layer; the fourth pooling pyramid module comprises: 6 x 6 global average pooling layer and a fourth convolution layer; the first global average pooling module comprises: 1 × None of a first pooling layer and a fifth convolution layer; the second global average pooling module comprises: none 1 second pooling layer and sixth convolution layer; wherein each of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, and the sixth convolution layer is a 1 x1 convolution layer, the 1 x None first pooling layer is for average pooling for each row of the lightweight feature map, and the None x1 second pooling layer is for average pooling for each column of the lightweight feature map.

As a further improved technical solution, the bar-shaped attention network includes: a first attention module and a second attention module; inputting the classification feature map into the bar-shaped attention network to obtain an attention feature map corresponding to the classification feature map, specifically comprising:

inputting the classification feature map into the first attention module to obtain a first bar feature map;

inputting the classification feature map into the second attention module to obtain a second bar feature map;

and multiplying the first bar feature map and the second bar feature map to obtain an attention feature map.

As a further improved technical solution, the first attention module includes: 1 x None of the third pooling layers and 1 x 3 of the seventh convolution layers; the second attention module includes: a fourth pooling layer of None x1 and an eighth pooling layer of 3 x1, wherein the third pooling layer of 1 x None is used for average pooling per row of the classification feature map and the None x1 fourth pooling layer is used for average pooling per column of the classification feature map.

As a further improved technical scheme, the semantic feature maps respectively correspond to different categories, and for each semantic feature map, the semantic feature map comprises the probability that each pixel point belongs to the category corresponding to the semantic feature map; the determining a semantic segmentation result according to the plurality of semantic feature maps specifically includes:

for each pixel point, determining the maximum probability corresponding to the pixel point in the multiple semantic feature maps, and taking the category identification corresponding to the maximum probability as the category identification of the pixel point;

and obtaining semantic segmentation results corresponding to the images to be processed according to the category identifications respectively corresponding to each pixel point.

In a second aspect, an embodiment of the present invention provides a method for generating a semantic segmentation model, where the method for generating the semantic segmentation model includes:

inputting a training image into an initial convolutional neural network, and outputting a plurality of training semantic feature maps through the initial convolutional neural network, wherein the training image is an image in a training set, the training set comprises a plurality of training groups, and each training group comprises a plurality of training images and a real image label corresponding to each training image;

determining a prediction result according to the plurality of training semantic feature maps, and determining an inter-class loss value and an intra-class loss value according to the plurality of training semantic feature maps;

determining an original loss value according to the prediction result and the real image label, and determining a total loss value based on the original loss value, the inter-class loss value and the intra-class loss value;

modifying the network parameters of the initial convolutional neural network based on the total loss value, and continuing to execute the step of inputting the training image into the initial convolutional neural network until the preset training condition of the initial convolutional neural network is met, so as to obtain a semantic segmentation model.

As a further improved technical solution, the plurality of training semantic feature maps respectively correspond to different categories, and for each training semantic feature map, the training semantic feature map includes a training probability that each pixel belongs to the category corresponding to the training semantic feature map;

determining an inter-class loss value and an intra-class loss value according to the plurality of training semantic feature maps, which specifically comprises:

for each training semantic feature map, performing global average pooling processing on the training semantic feature map to obtain a global average pooling result corresponding to the training semantic feature map;

δ ₂ ＝∑||C _i -C _j ||

wherein, delta ₁ Is the value of the loss within the class,

is the training probability of the pixel point with the coordinate of (m, n) in the ith training semantic feature map, C _i The training semantic feature graph is a global average pooling result corresponding to the ith training semantic feature graph, the training semantic feature graphs are N training semantic feature graphs, and each training semantic feature graph comprises x y pixel points; delta ₂ Is aValue of inter loss, C _j And obtaining a global average pooling result corresponding to the jth training semantic feature map.

In a third aspect, the present invention provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

determining a semantic segmentation result according to a plurality of semantic feature maps;

or inputting a training image into an initial convolutional neural network, and outputting a plurality of training semantic feature maps through the initial convolutional neural network, wherein the training image is an image in a training set, the training set comprises a plurality of training groups, and each training group comprises a plurality of training images and a real image label corresponding to each training image;

determining a prediction result according to the multiple training semantic feature maps, and determining an inter-class loss value and an intra-class loss value according to the multiple training semantic feature maps;

In a fourth aspect, the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

Compared with the prior art, the embodiment of the invention has the following advantages:

in the embodiment of the invention, an image to be processed is input into the lightweight neural network to obtain a lightweight characteristic diagram; inputting the lightweight feature map into the enhanced pyramid network to obtain an enhanced feature map, and splicing the lightweight feature map and the enhanced feature map to obtain a spliced feature map; inputting the spliced feature map into the classification network to obtain a plurality of classification feature maps; for each classification feature map, inputting the classification feature map into the bar-shaped attention network to obtain an attention feature map corresponding to the classification feature map, and adding the classification feature map and the attention feature map to obtain a semantic feature map corresponding to the classification feature map; and determining a semantic segmentation result according to the plurality of semantic feature maps. In the invention, the light-weight neural network is adopted, the number of parameters used for calculation is reduced, and the image semantic segmentation precision is improved by enhancing the pyramid network and the bar-shaped attention network, namely, the image semantic segmentation precision is ensured while the calculation amount is reduced, and the method is suitable for terminals with limited hardware resources.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flowchart of a semantic segmentation method for images according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a standard convolution according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a separable convolution according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a structure of an enhanced pyramid network according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart illustrating a semantic feature map obtained from the classification feature map and the bar-shaped attention network according to the embodiment of the present invention;

FIG. 6 is a flow chart illustrating a method for semantic segmentation of images in accordance with one implementation of the present invention;

FIG. 7 is a schematic flow chart illustrating a method for generating a semantic segmentation model according to an embodiment of the present invention;

fig. 8 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The inventor finds that, under the development background of deep learning, the convolutional neural network is more and more accepted by people, the application is more and more common, and the image semantic segmentation is widely applied to the fields of automatic driving, medical image diagnosis, remote sensing image analysis and the like. However, the general trend in deep learning is to achieve higher accuracy through deeper and more complex networks, but these deeper and more complex networks are generally not dominant in the size and running speed of the model; in real life, a plurality of terminals with limited hardware resources exist, image semantic segmentation cannot be applied to the terminals with limited hardware resources, and development and application of the image semantic segmentation are limited.

In order to solve the above problem, in the embodiment of the present invention, an image to be processed is input to the lightweight neural network, so as to obtain a lightweight feature map; inputting the lightweight feature map into the enhanced pyramid network to obtain an enhanced feature map, and splicing the lightweight feature map and the enhanced feature map to obtain a spliced feature map; inputting the spliced feature map into the classification network to obtain a plurality of classification feature maps; for each classification feature map, inputting the classification feature map into the bar-shaped attention network to obtain an attention feature map corresponding to the classification feature map, and adding the classification feature map and the attention feature map to obtain a semantic feature map corresponding to the classification feature map; and determining a semantic segmentation result according to the plurality of semantic feature maps. In the invention, the light-weight neural network is adopted, the number of parameters used for calculation is reduced, and the image semantic segmentation precision is improved by enhancing the pyramid network and the bar-shaped attention network, namely, the image semantic segmentation precision is ensured while the calculation amount is reduced, and the method is suitable for terminals with limited hardware resources.

The image semantic segmentation method provided by the invention can be applied to electronic equipment, wherein the electronic equipment comprises a terminal with limited hardware resources and limited calculation amount, and the electronic equipment can be realized in various forms, such as a PC (Personal computer), a server, a mobile phone, a tablet Personal computer, a palm computer, a Personal Digital Assistant (PDA) and the like. In addition, the functions implemented by the method can be implemented by calling program code by a processor in an electronic device, and the program code can be stored in a computer storage medium.

Various non-limiting embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, an image semantic segmentation method in an embodiment of the present invention is shown, which is applied to a semantic segmentation model, where the semantic segmentation model includes a lightweight neural network, an enhanced pyramid network, a classification network, and a bar-shaped attention network; the image semantic segmentation method comprises the following steps:

and S1, inputting the image to be processed into the lightweight neural network to obtain a lightweight characteristic diagram.

In the embodiment of the present invention, the image to be processed is an image used for semantic segmentation to obtain a semantic segmentation result. The lightweight neural network can be MobileNet v2, MobileNet v2 is a lightweight CNN network that is dedicated to mobile terminals or embedded devices, and reduces the number of model parameters while maintaining model performance through the use of deep separable convolution. Referring to fig. 2 and 3, the distinction between standard convolution, which separates it into one deep convolution, which applies each convolution kernel to each channel, and one dot convolution (convolution kernel 1 x 1), which combines the outputs of the channel convolutions, and separable convolution, which effectively reduces the amount of computation and reduces the size of the model, can be seen.

In the embodiment of the invention, the image to be processed is input into the MobileNet v2, and the MobileNet v2 outputs the light-weight feature map.

And S2, inputting the lightweight feature map into the enhanced pyramid network to obtain a spliced feature map.

In the embodiment of the invention, the enhanced pyramid network can effectively fuse image multi-scale information, the lightweight feature map is subjected to pooling in different scales to form a spatial pyramid structure to obtain context information, and then the context information is spliced to obtain a spliced feature map.

Specifically, referring to fig. 4, the enhanced pyramid network includes: a first pooling pyramid module 101, a second pooling pyramid module 102, a third pooling pyramid module 103, a fourth pooling pyramid module 104, a first global average pooling module 105, and a second global average pooling module 106; step S2 includes:

s21, inputting the lightweight feature map into the first pooling pyramid module, the second pooling pyramid module, the third pooling pyramid module, the fourth pooling pyramid module, the first global average pooling module and the second global average pooling module, respectively; the first feature map is obtained through the first pooling pyramid module, the second feature map is obtained through the second pooling pyramid module, the third feature map is obtained through the third pooling pyramid module, the fourth feature map is obtained through the fourth pooling pyramid module, the fifth feature map is obtained through the first global average pooling module, and the sixth feature map is obtained through the second global average pooling module.

In an embodiment of the present invention, the first pooling pyramid module includes: 1 x1 of a global average pooling layer and a first convolution layer; the second pooling pyramid module comprises: 2 x 2 of a global average pooling layer and a second convolution layer; the third pooling pyramid module comprises: 3 x 3 global average pooling layer and a third convolution layer; the fourth pooling pyramid module comprises: 6 x 6 global average pooling layer and a fourth convolution layer; the first global average pooling module comprises: 1 × None of a first pooling layer and a fifth convolution layer; the second global average pooling module comprises: none 1 second pooling layer and sixth convolution layer; wherein each of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, and the sixth convolution layer is a 1 x1 convolution layer, the 1 x None first pooling layer is for average pooling for each row of the lightweight feature map, and the None x1 second pooling layer is for average pooling for each column of the lightweight feature map. The 1 x1 convolutional layer is a 2D convolutional layer for reducing the number of channels.

In the embodiment of the invention, the lightweight feature map is input into a 1 × 1 global average pooling layer, and the output result of the 1 × 1 global average pooling layer is input into a first convolution layer to obtain a first feature map; inputting the lightweight feature map into a 2 x 2 global average pooling layer, and inputting an output result of the 2 x 2 global average pooling layer into a second convolution layer to obtain a second feature map; inputting the lightweight feature map into a 3 × 3 global average pooling layer, and inputting an output result of the 3 × 3 global average pooling layer into a third convolution layer to obtain a third feature map; inputting the lightweight feature map into a 6 × 6 global average pooling layer, and inputting an output result of the 6 × 6 global average pooling layer into a fourth convolution layer to obtain a fourth feature map; inputting the lightweight characteristic diagram into a first pooling layer of 1 × Non, and inputting an output result of the first pooling layer of 1 × Non into a fifth convolution layer to obtain a fifth characteristic diagram; and inputting the lightweight characteristic diagram into a second pooling layer of None 1, and inputting an output result of the second pooling layer of None 1 into a sixth convolution layer to obtain a sixth characteristic diagram. The first pooling layer and the second pooling layer are used for global average pooling; the first pooling level of 1 × None refers to global average pooling per row in the lightweight feature map, and the second pooling level of None × 1 refers to global average pooling per column in the lightweight feature map.

S22, determining a splicing feature map based on the first feature map, the second feature map, the third feature map, the fourth feature map, the fifth feature map and the sixth feature map.

In the embodiment of the present invention, the first feature map, the second feature map, the third feature map, the fourth feature map, the fifth feature map, and the sixth feature map are spliced in a channel dimension to obtain a spliced feature map.

And S3, inputting the spliced feature map into the classification network to obtain a plurality of classification feature maps.

In the embodiment of the invention, the spliced feature maps are classified through the classification network to obtain the classification feature maps respectively corresponding to each class. In a specific implementation manner, the classification network may determine that the joined feature map respectively belongs to a classification feature map corresponding to each of the 21 categories.

And S4, inputting the classification feature map into the bar-shaped attention network for each classification feature map to obtain an attention feature map corresponding to the classification feature map, and adding the classification feature map and the attention feature map to obtain a semantic feature map corresponding to the classification feature map.

In an embodiment of the present invention, referring to fig. 5, the bar attention network includes: a first attention module and a second attention module; specifically, for each classification feature map P _ f, inputting the classification feature map P _ f into the bar-shaped attention network 200, and obtaining the attention feature map corresponding to the classification feature map includes: inputting the classification feature map into the first attention module to obtain a first bar-shaped feature map; inputting the classification feature map into the second attention module to obtain a second bar feature map; and multiplying the first bar feature map and the second bar feature map to obtain an attention feature map P _ t.

In an embodiment of the present invention, the first attention module includes: the third pooling layer 201 of 1 × None and the seventh convolution layer 202 of 1 × 3; the second attention module includes: a fourth pooling layer 203 of None x1 and an eighth pooling layer 204 of 3 x1, wherein the third pooling layer of 1 x None is used for average pooling of each row of the classification feature map and the fourth pooling layer of None x1 is used for average pooling of each column of the classification feature map.

In the embodiment of the invention, the classification feature map is input into a third pooling layer of 1 × None, and the output result of the third pooling layer is input into a seventh convolution layer to obtain a first bar-shaped feature map, wherein the first bar-shaped feature map is a bar-shaped output of 1 × N; and inputting the classification feature map into a fourth pooling layer of None 1, and inputting an output result of the fourth pooling layer into an eighth convolution layer to obtain a second bar feature map, wherein the second bar feature map is a bar output of N1.

In the embodiment of the present invention, the first bar feature map and the second bar feature map are multiplied, specifically, matrix multiplication is performed to obtain an attention matrix of N × N, and the attention matrix of N × N is recorded as the attention feature map.

In the embodiment of the present invention, after obtaining the attention feature map corresponding to the classification feature map, the attention feature map and the classification feature map are added to obtain the semantic feature map P _ m corresponding to the classification feature map.

The addition of the classification characteristic map and the attention characteristic map is pixel-level addition; specifically, the attention feature map includes probabilities corresponding to a plurality of attention pixels, and the classification feature map includes probabilities corresponding to a plurality of classification pixels, where the probability of each pixel in the classification feature map indicates a probability that the pixel belongs to a category corresponding to the classification feature map, and each pixel in the attention feature map is a fusion of row and column information in a corresponding classification layer.

The plurality of attention pixel points and the plurality of classification pixel points are in one-to-one correspondence, and for each attention pixel point, the value of the attention pixel point is added with the value of the classification pixel point corresponding to the attention pixel point, so that a feature map of each pixel point in the classification layer after corresponding row and column information is fused is obtained.

And S5, determining a semantic segmentation result according to the multiple semantic feature maps.

In the embodiment of the present invention, before step S5, it is necessary to perform upsampling processing on each semantic feature map to obtain an upsampled semantic feature map, and replace the semantic feature map with the upsampled semantic feature map. And the semantic feature map after the up-sampling processing has the same size as the image to be processed.

The semantic feature maps respectively correspond to different categories. For example, the classification network outputs a plurality of classification feature maps, each of the plurality of classification feature maps corresponds to a plurality of different categories, and each semantic feature map is determined by a classification image, and the category corresponding to one semantic feature map is the category of the classification feature map corresponding to the semantic feature map.

In the embodiment of the invention, for each semantic feature map, the semantic feature map comprises the probability that each pixel point belongs to the category corresponding to the semantic feature map. For example, the value of any pixel point in the semantic feature map corresponding to the first category is the probability that the pixel point belongs to the first category.

Specifically, step S5 includes:

and S51, for each pixel point, determining the maximum probability corresponding to the pixel point in the multiple semantic feature graphs, and taking the category identification corresponding to the maximum probability as the category identification of the pixel point.

In the embodiment of the invention, each semantic feature map comprises the probability corresponding to each pixel point. The maximum probability refers to: and the maximum value of the pixel points in the semantic feature map. The category identification is used for reflecting categories, and the category identifications corresponding to any two different categories are different. The category identification is represented by a numerical index corresponding to the maximum probability.

For example, assume that there are 5 semantic feature maps, P1, P2, P3, P4, and P5; there are 5 categories including: a first category, a second category, a third category, a fourth category and a fifth category, the category identifications of the 5 categories being 0,1,2,3 and 4, respectively; wherein, 5 semantic feature maps correspond to 5 categories one by one. For the pixel point q1 being equal to (x1, y1), it is assumed that the probability value of q1 in P1 is 0.05, the probability value of q1 in P2 is 0.1, the probability value of q1 in P3 is 0.15, the probability value of q1 in P4 is 0.05, and the probability value of q1 in P5 is 0.65, the maximum probability is 0.65, and further the category corresponding to P5 is identified as 4.

And S52, obtaining semantic segmentation results corresponding to the image to be processed according to the category identification corresponding to each pixel point.

In the embodiment of the present invention, through step S51, the category identifier corresponding to each pixel point can be obtained, and the semantic segmentation result includes the category identifier corresponding to each pixel point. For example, for the pixel point q1, the class identifier of q1 is 4, and then 4 is marked at the position of the pixel point q 1. The visualization form of the semantic segmentation result can be an image form, and the image corresponding to the semantic segmentation result comprises the color respectively corresponding to each category; for example, if the color corresponding to 4 is red, all the pixels belonging to 4 are marked as red.

For convenience of illustration, referring to fig. 6, in an implementation, when the lightweight neural network is MobileNet v2, the process of inputting the image to be processed into the semantic segmentation model to obtain the semantic segmentation result includes:

inputting an image to be processed into a MobileNet v2 to obtain a lightweight feature map, inputting the lightweight feature map into an enhanced pyramid network to obtain a spliced feature map, inputting the spliced feature map into a classification network to obtain a plurality of classification feature maps, obtaining a plurality of attention feature maps based on the plurality of classification feature maps and a bar-shaped attention network, obtaining a plurality of semantic feature maps based on the plurality of attention feature maps, and obtaining a semantic segmentation result according to the plurality of semantic feature maps.

In the embodiment of the invention, the lightweight neural network MobileNet v2 is adopted, the number of parameters used for calculation is reduced, and the precision of image semantic segmentation is improved by enhancing the pyramid network and the bar-shaped attention network, namely, the invention reduces the calculation amount, ensures the precision of image semantic segmentation, and is suitable for terminals with limited hardware resources.

Next, a training method of the semantic segmentation model is introduced, and the present invention further provides a generation method of the semantic segmentation model, referring to fig. 7, the generation method of the semantic segmentation model includes:

m1, inputting the training images into an initial convolutional neural network, and outputting a plurality of training semantic feature maps through the initial convolutional neural network.

In the embodiment of the invention, a training set is obtained in advance, wherein the training set comprises a plurality of training groups, and each training group comprises a plurality of training images and real image labels respectively corresponding to each training image; the real image label comprises a real category identification corresponding to each pixel point in the training image, and the real category identification is used for reflecting the real category corresponding to the pixel point.

In the embodiment of the invention, one training image is obtained from one training group each time, the training image is used as an input item of an initial convolutional neural network, and after one training is finished, another training image is obtained from the training group and used as an input item of the initial convolutional neural network; after the training images in the training set are all input once, one training image can be obtained from the other training set to be used as an input item of the initial convolutional neural network.

In an embodiment of the present invention, the initial convolutional neural network includes: the method comprises the following steps of (1) an initial lightweight neural network, an initial enhanced pyramid network, an initial classification network and an initial bar-shaped attention network; the initial lightweight neural network and the lightweight neural network have the same network structure, and the network parameters of the initial lightweight neural network are initialized network parameters; the network structures of the initial enhancement pyramid network and the enhancement pyramid network are the same, and the network parameters of the initial enhancement pyramid network are initialized network parameters; the network structures of the initial classification network and the classification network are the same, and the network parameters of the initial classification network are initialized network parameters; the network structure of the initial bar-shaped attention network is the same as that of the bar-shaped attention network, and the network parameters of the bar-shaped attention network are initialized network parameters.

In an embodiment of the present invention, the process of inputting a training image into an initial convolutional neural network and outputting a plurality of training semantic feature maps through the initial convolutional neural network includes:

m11, inputting the training image into the training lightweight neural network to obtain a training lightweight characteristic diagram;

m12, inputting the training lightweight feature map into the training enhancement pyramid network to obtain a training splicing feature map;

m13, inputting the training splicing feature map into the training classification network to obtain a plurality of training classification feature maps;

m14, for each training classification feature map, inputting the training classification feature map into the training bar-shaped attention network to obtain a training attention feature map corresponding to the training classification feature map, and adding the training classification feature map and the training attention feature map to obtain a training semantic feature map corresponding to the training classification feature map.

Specifically, the process of step M11 to step M14 is the same as the process of obtaining multiple semantic feature maps based on the image to be processed, the lightweight neural network, the enhanced pyramid network, the classification network, and the bar attention network in step S1 to step S4, and further, for the process of step M11 to step M14, the descriptions of step S1 to step S4 may be referred to.

M2, determining a prediction result according to the multiple training semantic feature maps, and determining an inter-class loss value and an intra-class loss value according to the multiple training semantic feature maps.

In the embodiment of the invention, the training semantic feature maps correspond to a plurality of categories one by one, and any two different training semantic feature maps respectively correspond to two different categories; for each training semantic feature map, the training semantic feature map includes a training probability that each pixel belongs to its (the training semantic feature map) corresponding category. For each pixel point, determining the maximum training probability corresponding to the pixel point in a plurality of training semantic feature maps, and taking the training category identification corresponding to the maximum training probability as the training category identification corresponding to the pixel point; further, training category identifications respectively corresponding to all the pixel points can be obtained, and a prediction result is obtained according to the training category identifications respectively corresponding to all the pixel points.

In this embodiment of the present invention, determining the inter-class loss value and the intra-class loss value according to the multiple training semantic feature maps includes: and for each training semantic feature map, performing global average pooling processing on the training semantic feature map to obtain a global average pooling result corresponding to the training semantic feature map, and determining an intra-class loss value according to a formula (1).

Wherein, delta ₁ Is the value of the loss within the class,

is the training probability of the pixel point with the coordinate of (m, n) in the ith training semantic feature map, C _i And obtaining a global average pooling result corresponding to the ith training semantic feature map, wherein the training semantic feature map comprises N training semantic features, and each training semantic feature map comprises x by y pixel points. The expression (1) means that for each training semantic feature map, a first difference between the global average pooling result of the training semantic feature map and each training probability in the training semantic feature map is determined, and then each first difference in each training semantic feature map is summed to obtain an intra-class loss value. Minimizing the intra-class loss values during training of the convolutional neural network.

The inter-class loss value is determined according to equation (2).

δ ₂ ＝∑‖C _i -C _j || (2)

Wherein, delta ₂ Is an inter-class loss value, C _i Is the global average pooling result corresponding to the ith training semantic feature map, C _j Is the global average pooling result corresponding to the jth training semantic feature map, meaning of formula (2)The method comprises the following steps: and determining a second difference between any two global average pooling results, further determining a modulus of the second difference, and then summing the obtained moduli of all the second differences to obtain an inter-class loss value. Maximizing the inter-class loss value during training of the convolutional neural network.

M3, determining an original loss value according to the prediction result and the real image label, and determining a total loss value based on the original loss value, the inter-class loss value and the intra-class loss value.

In the embodiment of the invention, a cross entropy loss function is utilized to determine an original loss value delta according to the prediction result and the real image label ₃ And then determining the total loss value according to the formula (3).

loss＝δ ₃ +δ ₁ -δ ₂ (3)

Where loss is the total loss value, δ ₃ Is the original loss value, δ ₁ Is the value of the loss in class, δ ₂ Is the inter-class loss value.

M4, modifying the network parameters of the initial convolutional neural network based on the total loss value, and continuing to execute the step of inputting the training image into the initial convolutional neural network until the preset training condition of the initial convolutional neural network is met, so as to obtain a semantic segmentation model.

In the embodiment of the present invention, the network parameters of the initial convolutional neural network are modified based on the total loss values, and specifically, the network parameters of the initial lightweight neural network, the initial enhanced pyramid network, the initial classification network, and the initial strip attention network are modified based on the total loss values. All pictures in the data set are grouped and sequentially put into a network for training, the weight of the network is modified through gradient back propagation, and all the pictures in the data set are input into the network once to be regarded as finishing one round of training; performing iterative training until a preset training condition of the initial convolutional neural network is met, wherein the preset training condition may be convergence of the initial convolutional neural network, or the training frequency reaches a preset number of times, for example, the training frequency may be 5000 times; and obtaining a semantic segmentation model after training.

In the embodiment of the invention, the semantic segmentation model comprises a lightweight neural network, an enhanced pyramid network, a classification network and a bar-shaped attention network, wherein the lightweight neural network, the enhanced pyramid network, the classification network and the bar-shaped attention network are trained networks. That is, steps M1 through M4 are used to generate the semantic segmentation models described in steps S1 through S5.

The semantic segmentation method provided by the invention is tested in a VOC 2012 standard data set and a Cam Vid data set, the PA (pixel accuracy), the mIou (average cross-over ratio), the parameter quantity and the training time of the network with the same layer number are compared, and the test results are shown in tables 1 and 2.

Table 1 VOC 2012 experimental results

Method	Backbone	Amount of ginseng	mIoU(％)	PA(％)
					PSPNet	ResNet50	392MB	70.8	92
Deeplab3	Resnet50	309MB	71.8	92.3
					Semantic segmentation model	MobileNet v2	18.7MB	71.7	92.5

The semantic segmentation model has much lower parameters than PSPNet and depeplabv 3 +. A better result was also achieved in terms of mlou.

TABLE 2 Cam Vid results of the experiment

Method	Backbone	Training time	mIoU(％)	PA(％)
					PSPNet	Resnet50	12.1h	53.8	86.4
PSPNet	Mobilenet_v2	2.9h	51.7	83.5
					Semantic segmentation model	Mobilenet_v2	3.4h	52.6	84.8

On the Cam Vid data set, time consumption of the network is mainly compared, and it can be found that compared with the PSPNet, the time consumption of the semantic segmentation model is increased a little, but the pixel accuracy is improved.

The embodiment of the invention also provides computer equipment which can be a terminal, and the internal structure of the computer equipment is shown in figure 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of semantic segmentation of images or a method of generation of a semantic segmentation model. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the illustration in fig. 8 is merely a block diagram of a portion of the structure associated with the disclosed aspects and is not intended to limit the scope of the claimed subject matter to which the claimed subject matter may be applied, and that a particular computing device may include more or less components than those illustrated, or may combine certain components, or have a different arrangement of components.

The embodiment of the present invention further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the following steps when executing the computer program:

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps:

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

Claims

1. The image semantic segmentation method is characterized by being applied to a semantic segmentation model, wherein the semantic segmentation model comprises a lightweight neural network, an enhanced pyramid network, a classification network and a bar-shaped attention network; the image semantic segmentation method comprises the following steps:

the enhanced pyramid network includes: the system comprises a first pooling pyramid module, a second pooling pyramid module, a third pooling pyramid module, a fourth pooling pyramid module, a first global average pooling module and a second global average pooling module; inputting the lightweight feature map into the enhanced pyramid network to obtain a spliced feature map, specifically comprising:

splicing the first feature map, the second feature map, the third feature map, the fourth feature map, the fifth feature map and the sixth feature map in a channel dimension to obtain a spliced feature map;

the bar attention network comprises: a first attention module and a second attention module; inputting the classification feature map into the bar-shaped attention network to obtain an attention feature map corresponding to the classification feature map, specifically comprising:

2. The semantic segmentation method of claim 1 wherein the first pooling pyramid module comprises: 1 x1 of a global average pooling layer and a first convolution layer; the second pooling pyramid module comprises: 2 x 2 of a global average pooling layer and a second convolution layer; the third pooling pyramid module comprises: 3 x 3 global average pooling layer and a third convolution layer; the fourth pooling pyramid module comprises: 6 x 6 global average pooling layer and a fourth convolution layer; the first global average pooling module comprises: 1 × None of a first pooling layer and a fifth convolution layer; the second global average pooling module comprises: none 1 second pooling layer and sixth convolution layer; wherein each of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, and the sixth convolution layer is a 1 x1 convolution layer, the 1 x None first pooling layer is for average pooling for each row of the lightweight feature map, and the None x1 second pooling layer is for average pooling for each column of the lightweight feature map.

3. The semantic segmentation method according to claim 1, wherein the first attention module comprises: 1 x None of the third pooling layers and 1 x 3 of the seventh convolution layers; the second attention module includes: a fourth pooling layer of None 1 and an eighth pooling layer of 3 x1, wherein the third pooling layer of 1 x None is used for average pooling each row of the classification signature and the fourth pooling layer of None 1 is used for average pooling each column of the classification signature.

4. The semantic segmentation method according to claim 1, wherein the semantic feature maps respectively correspond to different categories, and for each semantic feature map, the semantic feature map comprises a probability that each pixel belongs to the category corresponding to the semantic feature map; the determining a semantic segmentation result according to the multiple semantic feature maps specifically includes:

5. A method for generating a semantic segmentation model according to claim 1, the method comprising:

6. The method for generating a semantic segmentation model according to claim 5, wherein the training semantic feature maps respectively correspond to different categories, and for each training semantic feature map, the training semantic feature map includes a training probability that each pixel belongs to the category corresponding to the training semantic feature map;

δ ₂ ＝∑||C _i -C _j ||

wherein, delta ₁ Is the value of the loss within the class,

is the training probability of the pixel point with coordinates (m, n) in the ith training semantic feature map, C _i The training semantic feature graph is a global average pooling result corresponding to the ith training semantic feature graph, the training semantic feature graphs are N training semantic feature graphs, and each training semantic feature graph comprises x y pixel points; delta. for the preparation of a coating ₂ Is an inter-class loss value, C _j And obtaining a global average pooling result corresponding to the jth training semantic feature map.

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor when executing the computer program implements the steps of the image semantic segmentation method of any one of claims 1 to 4 or the generation method of the semantic segmentation model of any one of claims 5 to 6.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the image semantic segmentation method according to one of claims 1 to 4 or the generation method of the semantic segmentation model according to one of claims 5 to 6.