CN114494701A

CN114494701A - Semantic segmentation method and device based on graph structure neural network

Info

Publication number: CN114494701A
Application number: CN202210134177.5A
Authority: CN
Inventors: 胡浩基; 白健弘; 王化良; 龙永文; 欧阳涛; 黄源甲
Original assignee: Zhejiang University ZJU; Foshan Shunde Midea Electrical Heating Appliances Manufacturing Co Ltd
Current assignee: Zhejiang University ZJU; Foshan Shunde Midea Electrical Heating Appliances Manufacturing Co Ltd
Priority date: 2022-02-14
Filing date: 2022-02-14
Publication date: 2022-05-13

Abstract

The invention discloses a semantic segmentation method and a semantic segmentation device based on a graph structure neural network. A class semantic enhancement module (CSE) is proposed that uses graph models to create graph structures between channels and further outputs a "channel" - "object" relationship matrix to reconstruct a feature map. In addition, a full convolution network layer fusing the prior information of the object is created based on the similar semantic enhancement module, and a fine segmentation result is generated through rough segmentation and a feature map. On the basis of the two modules, a semantic enhancement network (CSENet) is proposed, and the network captures the interdependence and the relationship between 'channel' and 'object'. And a rough-to-fine segmentation strategy is used, and the CSENet layer and the CPFC layer are sequentially stacked, so that the rough segmentation result is gradually optimized, and the fine segmentation result is finally output as the final network output. Experiments show that the method can effectively improve the performance of the existing semantic segmentation network.

Description

Semantic segmentation method and device based on graph structure neural network

Technical Field

The invention relates to the fields of deep learning, semantic segmentation and the like, in particular to a semantic segmentation method and a semantic segmentation device based on a graph structure neural network.

Background

Semantic segmentation is a challenging basic task in computer vision, aimed at understanding and segmenting scenes. In the real world, objects in a scene are not independent, but interact to form a complex scene. Accurately capturing the interdependencies between objects helps to understand scene semantics, thus completing pixel-level segmentation in a scene.

Slave full convolution network^[1]FCN-based approaches have been the dominant solution since the advent (FCN). Recent efforts to employ multi-scale strategies have successfully exploited object context information. Deep Lab series method^[2][3][4]Convolution modules with different void rates are continuously explored to enlarge the receptive field and enhance the object context characteristics. PSPNet^[5]A global context is learned using a global average pool. As an extension and extension of the above method, some work^{[6][7][8][9][10]}By aggregating the feature maps at multiple scales, a wider range of semantic information is obtained. Although the multiscale approach broadens the receptive field, it results in partial local information loss and correlation between objects is neglected.

"relationship" based methods have performed well in recent years because they are not limited by the scope of the receptive field. They can be divided into two groups depending on the relationship dimension employed. One is pixel relation for studying imagesThe interaction between elements. The other is a regional relationship, which aims to study the characteristics of a specific region composed of certain pixels. Adaptive regions, a priori spatial distribution of objects, and per feature mapping of channels are common definitions of regions. For methods based on pixel relationships, DANet^[11]The correlation between pixels is studied and a coherent characterization of similar pixels is given. ACFNet^[12]The higher order relationships between pixels were further investigated with the aim of tracking the interdependencies between objects. Although Intersectssa^[13]A factorized pixel-level attention is proposed to reduce computational cost, but establishing correlations between pixels still significantly increases the complexity of the model.

Although region relationship-based methods can reduce the complexity of the model, the performance of these methods depends largely on the quality of region partitioning due to the complexity and variability of the image scene and semantics. The adaptive region partitioning method does not perform well because it lacks strong discrimination information about the target region. The attention module of DANet treats the channels as a characteristic representation of the object and enhances the representation of each channel by weighted summation between the channels. The implicit, uncopyable object region characteristics of the channel still face the problem of poor performance. Acnnet directly uses the coarse segmentation as an a priori spatial distribution to enhance the features of each pixel. However, each channel describes the object from various aspects (local or global) and is not limited to the spatial distribution of the object, which leads to inaccurate feature optimization.

[1].Long J，Shelhamer E，Darrell T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2015：3431-3440.

[2].Chen L C，Papandreou G，Kokkinos l，et al.Deeplab：Semantic image segmentation with deep convolutional nets，atrous convolution，and fully connected crfs[J].IEEE transactions on pattern analysis and machine intelligence，2017，40(4)：834-848.

[3].Chen L C，Zhu Y，Papandreou G，et al.Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European conference on computer vision(ECCV).2018：801-818.

[4].Chen L C，Papandreou G，Schroff F，et al.Rethinking atrous convolution for semantic image segmentation[J]arXiv preprint arXiv：1706.05587，2017.

[5].Zhao H，Shi J，Qi X，et al.Pyramid scene parsing network[C]//Proceedings of the lEEE conference on computer vision and pattern recognition.2017：2881-2890.

[6].Zhang H，Dana K，Shi J，et al.Context encoding for semantic segmentation[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.2018：7151-7160.

[7].Yang M，Yu K，Zhang C，et al.Denseaspp for sementic segmentation in street scenes[C]//Proceedings of the lEEE conference on computer Vision and pattern recognition.2018：3684-3692.

[8].He J，Deng Z，Zhou L，et al.Adaptive pyramid context network for semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision end Pattern Recognition.2019：7519-7528.

[9].Lin D，Shen D，Shen S，et a1.Zigzagnet：Fusing top-down and bottom-up context for object segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019：7490-7499.

[10].Fu J，Liu J，Tian H，et al.Dual attention network for scene segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019：3146-3154.

[11].Zhang H，Zhang H，Wang C，et al.Co-occurrent features in semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019：548-557.

[12].Huang L，Yuan Y，Guo J，et al.Interlaced sparse self-attention for semantic segmentation[J].arXiv preprint arXiv：1907.12273，2019.

[13].He K，Zhang X，Ren S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016：770-778.

Disclosure of Invention

The method is mainly applied to solving the semantic segmentation problem in a scene, namely, the correct class label is distributed to each pixel in the input picture. Semantic segmentation has a very wide application scene in the industry, and relates to medical image segmentation, automatic driving, face recognition, geological detection and the like.

Aiming at the defects of the prior art and solving the problems of inaccurate characteristic representation and higher complexity of the prior art, the invention provides a semantic segmentation method and a semantic segmentation device based on a graph structure neural network.

Firstly, a Class Semantic Enhancement module (CSE) is constructed by using a graph attention mechanism, and the module creates graph structures among channels and further outputs a more accurate ' channel ' -object ' similarity matrix to reconstruct distinguishable Class Semantic representations. Secondly, a full convolution network layer (CPFC) containing object prior information is constructed based on a Class semantic enhancement module, and the CPFC generates a more refined segmentation result through a rough segmentation result and a reconstructed feature map. On the basis of the two modules, a semantic enhancement network (CSENet) is formed, the network can capture the interdependence between different dimensions of the feature map and the prior information between a channel and an object, and therefore the final segmentation result can be generated.

The purpose of the invention can be realized by the following technical method: a semantic segmentation method of a neural network based on a graph structure comprises the following steps:

(1) acquiring an image to be segmented, and inputting the image into a residual error network for feature extraction; generating a primary segmentation result from the extracted feature map through a convolution network;

(2) inputting the preliminary segmentation result into a plurality of similar semantic enhancement networks which are connected in sequence, and gradually refining the segmentation result, wherein each similar semantic enhancement network comprises a similar semantic enhancement module and a full convolution network layer fusing object prior information; the quasi-semantic enhancement module takes the generated feature map and the segmentation result of the last quasi-semantic enhancement network as input, outputs a new feature map and a joint probability density matrix to a full convolution network layer of the prior information of the fusion object, and further outputs the segmentation result;

(3) and restoring the segmentation result after the last class semantic enhancement network is refined into the original resolution of the input image, and taking the class with the highest confidence as the final class of each pixel to obtain the image after semantic segmentation.

Further, a similar semantic enhancement module in the similar semantic enhancement network converts a feature map and a segmentation result of the previous semantic enhancement network into matrixes and multiplies the matrixes to obtain an object-channel relation matrix, linear mapping is performed on the relation matrix by using k learnable parameters, cosine similarity is calculated in pairs for each mapped dimensional vector to form k adjacent matrixes A^k(k＝1，2，3)，A^kElement A in (A)_i，jThe value of (a) can represent the correlation degree between a characteristic diagram channel i and a characteristic diagram channel j, and the larger the value of (b), the more compact the correlation of the two channels; interacting and aggregating semantic information on the dimension with close association by using a graph neural network to generate a joint probability density matrix; and multiplying the relation matrix and the joint probability density matrix element by element, then multiplying the multiplied relation matrix and the joint probability density matrix by a matrix converted by the segmentation result to obtain a reconstructed feature map, fusing the reconstructed feature map with the feature map generated by the last similar semantic enhancement network, and outputting the fused feature map.

Further, the specific process of generating the joint probability density matrix is as follows: is defined as P ∈ R^C*NWherein R is^C*NThe vector space is a C x N dimensional real vector space, and the mathematical expression of the vector space is as follows:

P＝σ(||A^kRψ^k)

wherein psi^k∈R^N*NA learnable parameter matrix that is the kth graph structure; sigma represents a sigmoid function; element P in matrix P_ijFeature graph F representing the ith dimension of the feature graph_iIn relation to the article with label number jProbability of a body.

Further, the specific processing process of the full convolution network layer fusing the prior information of the object on the characteristic diagram and the joint probability density matrix is as follows:

wherein

For the fine segmentation result, C is the number of feature map channels, p_jiIs the element in the jth row and ith column in the joint probability density matrix P, w_ijFor the element in row i and column j in the learnable convolution kernel W, f_jIs the ith channel of the feature map, b_iThe bias term is learnable for the ith.

Further, the quasi-semantic enhancement network respectively calculates the losses of the quasi-semantic enhancement module and the full convolution network layer fusing the prior information of the object by using a cross entropy function, the final loss is obtained by weighting and summing the two losses, and the proportion of the two losses in the final loss is controlled by an exponential decay factor.

Further, the process of restoring the thinned segmentation result to the original resolution of the input image is realized by bilinear interpolation.

In a second aspect, the present invention also provides an apparatus comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method for semantic segmentation of graph structure based neural networks.

In a third aspect, the present invention also provides a computer-readable storage medium for storing one or more computer programs, the one or more computer programs comprising program code for performing the above-mentioned method for semantic segmentation of a graph structure based neural network, when the computer program runs on a computer.

The invention has the beneficial effects that:

(1) the semantic enhancement module models the relevance between a channel and an object and the relevance between the channel and the channel, and learns the category semantic information which considers the intra-class consistency and the inter-class differentiability through the organic combination of the two relevance.

(2) The full convolution network layer fusing the object prior information is provided, the object prior information is acted on convolution kernel parameters by the full convolution network layer, and the operation can effectively improve the accuracy of the segmentation result.

(3) And a rough-to-fine segmentation strategy is used, and the CSENet layer and the CPFC layer are sequentially stacked, so that the rough segmentation result is gradually optimized, and the fine segmentation result is finally output as the final network output. Experiments show that the method can effectively improve the performance of the existing semantic segmentation network.

(4) The proposed network parameters are few, and the method can be applied to most of segmentation methods based on the deep neural network, so that the remarkable performance improvement is obtained.

Drawings

Fig. 1 is an overall framework structure of the deep neural network of the present invention.

FIG. 2 is a diagram of a class semantic enhancement module (CSE) according to the present invention.

Fig. 3 is a schematic diagram of a full convolution network layer (CPFC) for fusing object prior information according to the present invention.

Fig. 4 is an example of an image to be segmented input by the present invention.

Fig. 5 is an example of a coarse segmentation result generated by the present invention.

FIG. 6 is an example of the final output of the present invention.

FIG. 7 is a block diagram of a semantic segmentation apparatus based on a graph-structured neural network according to the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

The invention provides a graph structure-based semantic segmentation task-oriented neural network, which specifically comprises the following steps:

1. problem description and variable definition

For the semantic segmentation problem with a total number of classes N, the goal is to assign the correct class label to each pixel of the input image. The existing standard deep neural network-based method is as follows: for a given input image I ∈ R^3*H*WWhere H and W are the image height and width values, respectively. First, a picture I is input into a backbone network (e.g., a residual network) to generate a feature map F ∈ R^C*h*wWherein C is the number of characteristic image channels, and h and w are height and width values after down sampling. Then, a convolution kernel ω ∈ R is used^N ^*C*1*1Performing convolution operation with the characteristic diagram to generate X₀∈R^N*h*w. Finally, the interpolation function and argmax function act on X respectively₀Outputting the segmentation result Y₀∈R^H*W。

Taking the picture to be segmented as an example in fig. 4, the picture includes "targets" of multiple categories, for example: pedestrians, roads, signs, etc. The goal of the semantic segmentation task is to assign the correct label to each type of "target" in the image. The present invention aims to further process the inaccurate segmentation result (i.e., the "coarse segmentation" result) to obtain a more accurate segmentation result (i.e., the "fine segmentation" result). Fig. 5 is an example of a rough segmentation result, and it can be observed that the neural network is not accurate in segmenting the sidewalk in the dashed frame at the lower left of the picture, and the error classification phenomenon exists for the direction board in the solid frame at the upper middle part of the picture. Fig. 6 shows the "fine segmentation" result outputted by the present invention, and it can be observed that in the segmentation result of the present invention, the left-bottom side of the "sidewalk" is more accurately segmented, and the phenomenon of misclassification of the "sign" is eliminated, which enhances the accuracy of the neural network in the segmentation task to a certain extent.

As shown in FIG. 1, the present invention proposes a coarse-to-fine segmentation network called semantic enhancement class network (CSENet), which can be flexibly inserted after the standard segmentation model to generate more accurate results, gradually refine the feature map F and coarse segmentationResults X₀。

Specifically, as shown in fig. 2, the semantic-like enhancement network proposed by the present invention is composed of a full convolution network layer (CPFC) in which a semantic-like enhancement module (CSE) fuses object prior information, and gradually refines the segmentation result by stacking n semantic-like enhancement networks in sequence. The CSE module reconstructs a characteristic diagram by using high-order information among channels; the CPFC module combines the class-channel relationship graph with the reconstructed feature graph to obtain a fine-grained segmentation result { X_i∈R^N*HWI ═ 1, 2, 3 … n }, where X is_iIndicating the result of the division of the ith cell.

2. Feature extraction network

Most of the semantic segmentation methods based on the deep neural network input an image to be segmented into a backbone network, extract picture features, and output a feature map with high dimensionality and semantic information for subsequent processing. The invention uses a more common and advanced performance Residual network (ResNet) [ He K, Zhang X, Ren S, et al. deep Residual learning for image registration [ C ]// Proceedings of the IEEE registration on computer vision and pattern registration.2016: 770-778 as a feature extraction network (i.e., a backbone network). The residual error network belongs to a convolutional neural network and mainly comprises a convolutional layer and a nonlinear layer. The network introduces a residual error structure in a classical convolutional neural network, solves the problems of gradient disappearance, gradient explosion, network degradation and the like to a certain extent, and improves the performance of the traditional convolutional neural network. The residual structure can be described by the following mathematical expression:

H(X)＝F(X)+X

wherein, X is the output of an upper network; f represents a mapping function of the layer network to the input; h is the final output map for that layer network.

From the above formula, each layer of output of the residual network is the sum of the output after being mapped by the network parameters and the original input, i.e. the network parameters only need to fit the "residual" between the "input" and the "ideal output".

3. Description of the overall construction

The invention firstly introduces the whole structure of the proposed semantic enhancement network and explains the relation between the semantic enhancement network and other modules such as a backbone network. As shown in fig. 1, the overall flow of the graph structure-based semantic segmentation-oriented neural network proposed by the present invention can be described by the following mathematical formula:

F₀＝Backbone(I)

X₀＝FC(F₀)

F_i，X_i＝CSENetⁱ(F_i-1，X_i-1)i＝1，2，3…n

Y_n＝Argmax(Bilinear(X_n))

wherein, I is an input image to be segmented; backhaul represents the above Backbone network; f₀Outputting a characteristic diagram for the backbone network; FC is used for FCN network and generates segmentation result X₀A convolutional network layer of (a); CSENetⁱEnhancing the network for the ith class of semantics; f_iAnd X_iFeature maps and segmentation results output for the ith CSENet; x_nThe segmentation result is output after the nth iteration; biliner represents Bilinear interpolation operation; argmax is the maximum value operation on the category dimension; y is_nAnd outputting the final segmentation result for the network.

From the above formula, the proposed semantic information-like enhancement network takes the feature map and the rough segmentation result as input, and outputs the feature map with more clear semantic information-like and the more refined segmentation result. Specifically, the network first obtains the primitive feature graph F using the convolution layer in the backbone network and the FCN network₀And preliminary segmentation result X₀. Then the two are input into a time sequence network consisting of n CSENets to output a segmentation result X after n iterations_n. Finally, the result is carried out bilinear interpolation to restore the original resolution of the input image, the category with the highest confidence level is taken as the final category of each pixel, and the final category is output and recorded as Y_n。

The quasi-semantic enhancement network provided by the invention consists of two modules which are respectively a quasi-semantic enhancement module and a full convolution network layer fusing object prior information. The input and output of the two modules can be described by the following formula:

F_i，P_i＝CSEⁱ(F_i-1，X_i-1)

X_i＝CPFCⁱ(F_i，P_i)

wherein, CSEⁱAnd CPFCⁱThe i-th semantic enhancement module and the full convolution network layer fusing the prior information of the object are respectively.

As can be seen from the above formula, the CSE module takes the feature map and the rough segmentation result generated in the previous iteration (i.e. the i-1 st iteration) as input, and outputs a new feature map F_iWhile outputting a joint probability density matrix P_i(ii) a CPFC layer with F_iPx is used as input, and the segmentation result X after the ith round of iteration is output_i。

4. CSE module

(1) Computing correlations between categories and feature maps

The invention defines the correlation between classes and feature graph channels as follows:

wherein, F_iRepresenting a feature map in the ith dimension, X_jRepresents the jth rough segmentation graph; t represents a transpose operation;

representing the summation of each element in the jth rough segmentation map; r_i，jIs shown as F_iAverage response value of the above jth class. Higher R_i，jIndicating that the ith dimension and the jth class are semantically more related and vice versa.

(2) Learning a joint probability density matrix P

First, the invention builds K independent graph structures to represent the interdependence between different dimensions of features. Their adjacency matrices are defined as follows:

wherein

The inner product operation is represented by the following operation,

is a learnable projection function, R_iA vector for representing the strength of the relationship between the ith dimension of the characteristic diagram and the objects corresponding to the label numbers of various types,

is the edge weight between the ith and jth feature dimensions. Is higher

Is represented by F_iAnd F_jMore likely to contain the same class.

Secondly, the standard multi-head graph model is used for interacting and aggregating the class semantics on the mutually dependent dimensionality, and finally a joint probability density matrix is generated and defined as P belonging to R^C*NThe mathematical expression is as follows:

P＝σ(IIA^kRψ^k)

wherein A is^kFor the k-th adjacency matrix between feature map channels, R is the channel-object relationship matrix, psi^k∈R^N*NA learnable parameter matrix that is the kth graph structure; sigma represents a sigmoid function; ith, j element P in matrix P_ijIs represented by F_iRelated to the probability of class j.

(3) Feature reconstruction

Adjusting R using a joint probability density matrix P to further reconstruct a feature map

Subsequently, the residuals are subtracted using a convolution kernel with dimension 2C 1Network generated feature map F and

and fusing and outputting a new feature diagram F', wherein |, indicates a corresponding element multiplication operation.

5. CPFC layer

As shown in fig. 3, the CPFC layer uses the joint probability density matrix P output by the CSE module, the reconstructed new feature map F 'as input, and uses the joint probability density matrix P to multiply the learnable convolution kernel element by element, and then performs convolution operation with the feature map F', and outputs a fine segmentation result. The mathematical expression of the process is as follows:

wherein p is_jiIs the element in the jth row and ith column in the joint probability density matrix P, w_ijFor the element in row i and column j in the learnable convolution kernel W, f_jIs the ith channel of feature map F, b_iThe bias term is learnable for the ith.

6. Loss function

In the design of the loss function, the invention uses the cross entropy function to respectively calculate the loss L of the rough segmentation and the fine segmentation_cAnd L_f. The mathematical expression of the cross entropy function is:

where x is the network input, p (x) is the desired output, q (x) is the actual output of the network, and H (p, q) is the cross entropy.

The final loss L can be illustrated by the following equation:

L＝(a+γ)L_c+(b-γ+0.1)L_f

wherein L is_c，L_fCoarse and fine segmentation losses calculated for the above-mentioned use of cross entropy losses; l is a loss function of the final application of the network; a and b are constants, a is 0.4, and b is 0.9; iter is the current iteration number of the semantic enhancement network (namely the current semantic enhancement network); gamma is an exponential decay factor whose value is less than a threshold iteration number iter_γThe time is exponentially attenuated from b to 0, and the design enables the network to pay more attention to the rough segmentation result in the initial training stage and pay more attention to the fine segmentation result in the later stage of the network.

7. Outputting the segmentation result

Obtaining X after n iterations_nThen, after bilinear interpolation and maximum value operation on class dimension are carried out on the segmentation result, the final segmentation result Y is obtained_n。

Given point (x)₁，y₁)，(x₁，y₂)，(x₂，y₁)，(x₂，y₂) And their values q₁₁，q₁₂，q₂₁，q₂₂The value q of the point to be found (x, y) is given by:

according to this rule, the invention works on X with dimensions N × hw_nThe matrix with dimension N × HW is obtained by upsampling, where H and W are the image height value and width value, respectively, N is the total number of classes, and l represents the l-th class. Finally, the N x 1 dimensional vector corresponding to the pixel of the ith row and the jth column of the output after the nth iteration

Take its maximum value

As the final class of this pixel, namely:

obtained Y_nThe dimension is H x W, and the dimension is H x W,

the value of (d) represents the class label of the pixel in the ith row and the jth column.

The embodiment of the invention facing the automatic driving task is as follows:

(1) preparation work

Firstly, a data set required by an experiment needs to be prepared as Cityscapes, the data set has 5000 images of driving scenes in an urban environment, the data set is correctly divided, namely pedestrians, roads, signs and the like are successfully distinguished, and the data set can be effectively applied to an automatic driving task.

Next, the backbone network pre-training parameters are loaded (download link is https:// download. catalog. org/models/respet 50-19c 8. 357. pth).

(2) Setting the hyper-parameters, which mainly comprises the following hyper-parameters:

name of hyper-parameter	Initial learning rate	Void fraction	a	b	epoch	Batch size	n
								Value of	0.009	8	0.4	0.9	120	16	3

(3) And selecting any data set to train the network, and testing the accuracy of the network after the training is finished. In experiments where the backbone network was ResNet-50 and the dataset was ctyscapes, the original FCN network accuracy was 72.25% and after using the CSENet proposed by the present invention, the accuracy on the ctyscapes dataset was 75.18%.

Corresponding to the embodiment of the semantic segmentation method based on the graph structure neural network, the invention also provides an embodiment of a semantic segmentation device based on the graph structure neural network.

Referring to fig. 7, an embodiment of the present invention provides a semantic segmentation apparatus based on a graph structure neural network, which includes a memory and one or more processors, where the memory stores executable codes, and when the processors execute the executable codes, the semantic segmentation apparatus is configured to implement the semantic segmentation method based on a graph structure neural network in the foregoing embodiment.

The semantic segmentation apparatus based on graph structure neural network of the present invention can be applied to any device with data processing capability, such as a computer or other devices or apparatuses. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 7, the present invention is a hardware structure diagram of any device with data processing capability in which a semantic segmentation apparatus based on a graph structure neural network is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, in an embodiment, any device with data processing capability in which the apparatus is located may also include other hardware according to an actual function of the any device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

An embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the graph structure neural network-based semantic segmentation method in the foregoing embodiments.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims

1. A semantic segmentation method based on a graph structure neural network is characterized by comprising the following steps:

2. The method according to claim 1, wherein the class is a class of the graph-structured neural networkA similar semantic enhancement module in the semantic enhancement network converts the characteristic diagram and the segmentation result of the previous semantic enhancement network into matrixes and multiplies the matrixes to obtain an object-channel relation matrix, the relation matrixes are linearly mapped by using k learnable parameters, cosine similarity is calculated in pairs for each mapped vector to form k adjacent matrixes A^k(k＝1，2，3)，A^kElement A in (A)_i，jThe value of (a) can represent the correlation degree between a characteristic diagram channel i and a characteristic diagram channel j, and the larger the value of (b), the more compact the correlation of the two channels; interacting and aggregating semantic information on the dimension with close association by using a graph neural network to generate a joint probability density matrix; and multiplying the relation matrix and the joint probability density matrix element by element, then multiplying the multiplied relation matrix and the joint probability density matrix by a matrix converted by the segmentation result to obtain a reconstructed feature map, fusing the reconstructed feature map with the feature map generated by the last similar semantic enhancement network, and outputting the fused feature map.

3. The method for semantic segmentation based on the graph structure neural network according to claim 2, wherein the specific process of generating the joint probability density matrix is as follows: is defined as P ∈ R^C*NWherein R is^C*NThe vector space is a C x N dimensional real vector space, and the mathematical expression of the vector space is as follows:

P＝σ(||A^kRψ^k)

wherein psi^k∈R^N*NA learnable parameter matrix that is the kth graph structure; sigma represents a sigmoid function; element P in matrix P_ijFeature graph F representing the ith dimension of the feature graph_iRefers to the probability of the object with label number j.

4. The semantic segmentation method based on the graph structure neural network as claimed in claim 1, wherein the full convolution network layer fusing the prior information of the object specifically processes the feature map and the joint probability density matrix as follows:

wherein

5. The semantic segmentation method based on the graph structure neural network according to claim 1, characterized in that the semantic-like enhancement network uses a cross entropy function to calculate losses of the semantic-like enhancement module and a full convolution network layer fusing prior information of an object, respectively, the final loss is obtained by weighted summation of the two losses, and the proportion of the two losses in the final loss is controlled by an exponential decay factor.

6. The semantic segmentation method based on the graph structure neural network as claimed in claim 1, wherein the process of restoring the refined segmentation result to the original resolution of the input image is realized by bilinear interpolation.

7. An apparatus, characterized in that the apparatus comprises:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method recited in any of claims 1-6.

8. A computer-readable storage medium storing one or more computer programs, the one or more computer programs comprising program code for performing the method for graph structure neural network based semantic segmentation of a neural network of any one of claims 1-6 above, when the computer program runs on a computer.