CN113902753A

CN113902753A - Image semantic segmentation method and system based on dual-channel and self-attention mechanism

Info

Publication number: CN113902753A
Application number: CN202111129122.7A
Authority: CN
Inventors: 李天平; 魏艳军; 严业金; 丁同贺; 欧佳瑜
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2022-01-07

Abstract

The invention discloses an image semantic segmentation method and system based on dual channels and a self-attention mechanism, wherein the method comprises the following steps: acquiring a picture to be segmented; respectively extracting feature maps of two channels from a picture to be segmented; extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel; obtaining a characteristic diagram of each pixel which is related to a corresponding context area by the multi-scale context information characteristic diagram and the pixel level characteristic diagram through matrix operation and self-attention mechanism learning; and inputting the characteristic graph of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the picture.

Description

Image semantic segmentation method and system based on dual-channel and self-attention mechanism

Technical Field

The invention belongs to the technical field of computer vision and image processing, and particularly relates to an image semantic segmentation method and system based on a dual-channel and self-attention mechanism.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The semantic segmentation technology not only belongs to the research category of computer science, has important application significance for image restoration by the semantic segmentation of research images, but also relates to a front-edge research field of multiple subjects such as medical images, automatic driving, satellite remote sensing and the like, and has very important research significance and application value.

At present, the semantic segmentation technology generally adopts traditional convolutional neural networks, such as VGG, ResNet and the like, but the structures of the traditional convolutional neural networks are complex, the calculation amount is large, and the difference exists between the real-time performance and the application landing of the semantic segmentation.

The traditional network continuously extracts the features of the input picture, and through multilayer convolution and down-sampling operations, the resolution of the feature map is greatly reduced and the original information is lost, so that more and more network designs try to reduce the loss of the resolution, for example, the more and more network designs use an ASPP module consisting of cavity convolution to extract the semantic information of a multi-scale receptive field, wherein an encoder module is integrated with the cavity convolution or uses convolution operation with the step length of 2 to replace pooling, the contradiction between the step length and the receptive field and the feature map information is solved, although the effect is improved, the calculated amount is large, the cavity convolution cannot generate dense context information, and the excessive use of the cavity convolution can also generate a grid effect, so the problem of semantic segmentation is not fundamentally solved, and the semantic segmentation is a pixel-level segmentation task, therefore, the relationship among convolution, step size, resolution, receptive field and context is comprehensively considered; obviously, if only the ASPP module is used, the mesh effect is generated, the overall information of the context is lost, and the method is not a good solution for the intensive segmentation task.

The PSPnet adds a pyramid pooling module behind the extracted features, solves the problem of global information loss, has performance superior to that of Deeplab, but lacks the relationship between pixels and contexts, and the two methods have a common defect that the traditional network is used, so that the parameter quantity is too large, and the resolution loss is inevitably caused when the features are extracted deeply in the network. Therefore, introducing a network that keeps the high resolution feature extraction and the computation amount small becomes necessary for the precision and speed improvement of semantic segmentation.

In addition, semantic segmentation is a pixel-level segmentation task, and the conventional network only considers the relationship between pixels, such as a deep series network, or only considers the information between regions, like a pyramid pooling structure module proposed by PSPnet.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an image semantic segmentation method and system based on a dual-channel and self-attention mechanism, the balance between complexity and precision is comprehensively considered, the HRNetV2 network is used as a backbone to extract the features of an input picture, the self-attention mechanism is combined with a pyramid pooling module in a PSPNet to learn between pixels and regions of the feature image extracted by the HRNetV2 network, and further the precision is improved.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the image semantic segmentation method based on the double-channel and self-attention mechanism comprises the following steps:

acquiring a picture to be segmented;

respectively extracting feature maps of two channels from a picture to be segmented; extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel;

obtaining a characteristic diagram of each pixel which is related to a corresponding context area by the multi-scale context information characteristic diagram and the pixel level characteristic diagram through matrix operation and self-attention mechanism learning;

and inputting the characteristic graph of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the picture.

Further, the extracting of the multi-scale contextual information feature map by the first channel specifically includes: inputting the picture to be segmented into an HRNetV2_ w18 network to obtain a first feature map of a first channel, convolving the first feature map of the first channel to obtain a second feature map of the first channel, and extracting multi-scale context information from the second feature map of the first channel through a pyramid pooling model to obtain a third feature map of the first channel.

Further, after the third feature map of the first channel is obtained, adding random inactivation dropout and random inactivation neurons, and after convolution, adding a cross entropy loss function to perform network aided training to obtain a fourth feature map of the first channel, namely a multi-scale context information feature map.

Further, the extracting of the pixel-level feature map by the second channel specifically includes: and inputting the picture to be segmented into an HRNetV2_ w18 network to obtain a first feature map of a second channel, and extracting a second feature map of the second channel, namely a pixel-level feature map, from the first feature map of the second channel through convolution, an acceleration neural network and a Relu function.

Further, the matrix operation of the multi-scale context information feature map and the pixel level feature map specifically includes: matrix multiplying pixels of each position of the multi-scale region context information feature map and the pixel level feature map, wherein the matrix multiplying comprises the following steps:

preprocessing a multi-scale region context information characteristic graph and outputting the preprocessed multi-scale region context information characteristic graph to obtain a first matrix;

and outputting a second matrix after the pixel level characteristic diagram is subjected to normalization processing, multiplying the first matrix and the second matrix, and outputting a third matrix characteristic diagram, namely the fused pixel level region context characteristic diagram.

Further, the self-attention mechanism learning comprises learning the relationship between the pixel of each position and the context area of the pixel, and generating a corresponding feature map according to the relationship between the pixel of each position and the context area of the pixel;

the method specifically comprises the following steps: respectively inputting the fused pixel level region context characteristic diagram and the fused pixel level characteristic diagram into an attention mechanism to obtain three matrixes Q, K and V;

the Q matrix is a matrix used for inquiring the relation of the K matrix generated by the pixel level characteristic diagram; the K matrix is a matrix provided by Q-equal matrix checked generated by the fused pixel level region context characteristic diagram; the V matrix is a matrix with actual information and characteristic attributes generated by the fused pixel-level region context characteristic diagram.

Further, the feature map having a relationship between each pixel and its corresponding context area specifically includes:

the pixel level feature map queries the relationship between the pixel level feature map and the fused pixel level region context feature map through a Q matrix, the fused pixel level region context feature map queries through a K matrix to obtain a query result, and the query result is converted into a probability weight matrix of the corresponding relationship between the pixel level feature map and the fused pixel level region context feature map; and reconstructing the V matrix according to the probability weight matrix, and endowing the pixel regions of the same category and the pixel regions of different categories with probability weight parameters.

One or more embodiments provide an image semantic segmentation system based on dual channels and a self-attention mechanism, comprising:

an image acquisition module configured to: acquiring a picture to be segmented;

a two-channel feature map extraction module configured to: respectively extracting feature maps of two channels from a picture to be segmented; extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel;

a feature fusion module configured to: obtaining a characteristic diagram of each pixel which is related to a corresponding context area by the multi-scale context information characteristic diagram and the pixel level characteristic diagram through matrix operation and self-attention mechanism learning;

a semantic segmentation module configured to: and inputting the characteristic graph of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the picture.

One or more embodiments provide a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps of any of the two-channel and self-attention mechanism based image semantic segmentation methods as described above.

One or more embodiments provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the dual channel and autofocusing mechanism-based image semantic segmentation methods described above.

The above one or more technical solutions have the following beneficial effects:

(1) the method comprehensively considers the balance between the complexity and the precision of the network, the HRNeTV2 network is used as a backbone to extract the characteristics of the input picture, and the self-attention mechanism is combined with a pyramid pooling module in the PSPNet to learn the characteristics of the HRNeTV2 network among pixels and regions, so that the precision is improved.

(2) The semantic segmentation technology and the semantic segmentation system provided by the invention can optimize the network performance in the training process, strengthen the relation between the pixels and the context information, enable the network to pay attention to only the regions and the relation between the pixels and the regions, and do not segment all the regions of the whole feature map together, and the above scheme can improve the segmentation precision and can not cause wrong segmentation. The HRNeTV2 is used as a backbone instead of ResNet, and is mainly a light-weight feature extraction structure, low in computational power requirement, excellent in performance and high in model training speed. In general, the network system model is simple in structure, small in calculation amount and greatly improved in accuracy.

(3) In order to prevent network overfitting, random inactivated dropout is added, a Cross Entropyloss function is added to perform network aided training, the obtained feature graph has context semantic information, the feature graphs are obtained through multi-scale average pooling and are fused, and the fused feature graph has hierarchical and diverse context global semantic information. The auxiliary training has the effects of accelerating the training convergence speed of the system model and extracting richer context semantic information areas.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart of an image semantic segmentation method based on dual channels and a self-attention mechanism according to an embodiment of the present invention;

FIG. 2 is a structural diagram of an image semantic segmentation method based on dual channels and a self-attention mechanism in an embodiment of the present invention;

FIG. 3 is an overall framework diagram of a pyramid scene analysis network in an embodiment of the present invention;

FIG. 4 is a diagram of a self-attention mechanism in an embodiment of the present invention;

FIG. 5 is a graph of the loss value and the learning rate decrease change in the last 24 hours of the training process according to the embodiment of the present invention;

6(a) -6 (c) are original pictures of segmentation maps under different scenes in the embodiment of the present invention;

FIGS. 7(a) -7 (c) are segmentation map labels for different scenarios in embodiments of the present invention;

fig. 8(a) -8 (c) are graphs of the segmentation effect under different scenarios in the embodiment of the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

As shown in fig. 1, the present embodiment discloses an image semantic segmentation method based on dual channels and a self-attention mechanism, which includes the following steps:

s101, acquiring a picture to be segmented;

s102, respectively extracting feature maps of two channels from the picture to be segmented;

extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel;

and S103, obtaining a robustness characteristic diagram of each pixel in relation to the corresponding context region by matrix operation and self-attention mechanism learning of the multi-scale region context information characteristic diagram and the pixel level characteristic diagram.

And S104, inputting the feature map of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the image.

As shown in fig. 2-3, as one or more embodiments, in S102, extracting the multi-scale contextual information feature map by the first channel specifically includes:

inputting the picture to be segmented into HRNetV2_ w18 network to obtain a first feature map of a first channel

Convolving the first characteristic diagram of the first channel by 1 x 1conv to obtain a second characteristic diagram of the first channel

Extracting the second characteristic diagram of the first channel through a Pyramid Pooling Model (PPM) to obtain the first context informationThird characteristic diagram of the channel

Wherein, the characteristic diagram is

And (3) representing, wherein x represents a few channels, and y represents a few characteristic diagrams of corresponding channels.

In order to prevent overfitting of the network, random inactivation dropout is added into the third feature map of the first channel, random inactivation neurons with the probability of 0.1 are added, and a cross entropy loss function (cross entropy) function is added after 1 × 1 convolution to carry out network auxiliary training to obtain a final fourth feature map of the first channel

Namely a multi-scale context information characteristic diagram;

the cross Entropyloss function is calculated as follows:

wherein class represents a label belonging to a certain class, input represents a feature vector of an input feature map, and input [ j ] represents a score of a certain class.

The scheme has the advantages that the CrossEntropyloss function is added through randomly inactivating dropout so as to accelerate system model convergence and extract richer regions.

The HRNetV2_ w18 is a lightweight backbone with respect to resnet, and maintains the same size as the original image at the time of feature extraction, and the process of feature extraction is a process of maintaining high resolution.

In the traditional feature extraction, the resolution is lost for feature extraction, the network is deep, and the requirement on computing power is high. In consideration of computational power and accuracy, the present embodiment selects HRNetV2_ w18 as the backbone.

As one or more embodiments, the pyramid poolThe pooling (PPM) model outputs pooling of different sizes for feature extraction of the feature map, and the pooling uses AdaptAvgPool 2D (I)_x) Indicating that AdaptAvgPool 2D is a pooling function, I_xFeature maps representing different sizes of pooled output height and width, I_x＝[1,1],I_x＝[2,2],I_x＝[3,3],I_x＝[6,6]The number of channels was 67.

Therefore, rich context areas are extracted through pooling at different scales.

In S102, the extracting the pixel-level feature map in the second channel specifically includes:

inputting the picture to be divided into HRNetV2_ w18 network to obtain the first feature map of the second channel

Extracting the second characteristic diagram of the second channel after the first characteristic diagram of the second channel is subjected to 1 × 1conv convolution, the acceleration neural network BatchNorm and the Relu function

I.e. a pixel level signature.

As one or more embodiments, the matrix operation of the multi-scale region context information feature map and the pixel level feature map specifically includes:

matrix multiplying the pixels at each position of the multi-scale region context information characteristic diagram and the pixel level characteristic diagram; preprocessing a multi-scale region context information characteristic graph and outputting the preprocessed multi-scale region context information characteristic graph to obtain a first matrix; and outputting a second matrix after the pixel level characteristic diagram is subjected to normalization processing, multiplying the first matrix and the second matrix, and outputting a third matrix characteristic diagram, namely the fused pixel level region context characteristic diagram.

The specific implementation mode is as follows:

a fourth feature map of the first channel

Passes through reshape₁Output [ n, c, hXw ]]Then outputs [ n, h × w, c ] through tanspose]Named U, where n represents the number of pictures input at one time. c represents the number of channels of the feature map. h denotes the height of the feature map, and w denotes the width of the feature map.

Second feature map of second channel

Passes through reshape₁Output [ n, k, hXw ] after Softmax normalization processing]Named as I.

Multiplying the two matrixes of I and U, and finally outputting the multiplied result through a transpose unscueze function [ n, c, k1]The characteristic diagram is named

It is that

By the feature map after multi-region feature extraction

And

pixel multiplication of each position of the feature map by

To enhance the pixel

The characteristic diagram obtained by the relation between the regional characteristics has more obvious distinction between the regions of the characteristic diagram at this time and has prominent contextual information.

Therefore, the class information among the regions is more refined, and the pixels among each region in the feature map are enhanced and aggregated to obtain a richer feature map.

As one or more embodiments, as shown in fig. 4, the process of self-attention mechanism learning includes: inputting the result of matrix operation of the multi-scale context information characteristic diagram and the pixel level characteristic diagram into a self-attention mechanism, and learning to obtain the relationship between the pixel at each position and the region of the pixel;

learning the relationship between the pixel at each position and the context area of the pixel by a self-attention mechanism, and generating a corresponding feature map according to the relationship between the pixel at each position and the context area of the pixel, specifically:

will be provided with

Inputting the feature map into a self-attention mechanism, obtaining a feature map by using 1 multiplied by 1conv and the number of convolution kernels being 256, and then flattening the feature map in the dimensions of width and height to obtain a Key matrix;

similarly, the input image pixel level feature map is focused on the mechanism, using 1 x 1conv, with 256 convolution kernels, and the resulting feature map is flattened in the height and width dimensions to obtain a Query matrix, while,

and flattening in the width and height dimensions through 256 1 × 1conv to obtain a Value matrix. Here we will refer to Query, Key, Value matrix, for short Q, K, V. The Q matrix is a matrix used for inquiring the relation of the K matrix generated by the pixel level characteristic diagram; the K matrix is a matrix provided by Q-equal matrix checked generated by the fused pixel level region context characteristic diagram; the V matrix is a matrix with actual information and characteristic attributes generated by the fused pixel-level region context characteristic diagram.

The embodiment is to find

Each of which is connected with

The relation matrix between the corresponding areas comprises the following steps:

For example:

step 1:

to take its own query matrix Q to query

In the case of

To provide its own K matrix, the query results utilize QK^TThe realization is that the cosine angle of the included angle is close to 0 for the pixels belonging to the same class of labels, so the values of inner products of the pixels are very large, and the values of the inner products of the pixels belonging to different classes of labels are very small.

Step 2: will QK^TUsing the normalized exponential function softmax, transform

And

into corresponding relational probability weight matrix

And step 3: the V matrix is reconstructed, the pixel regions of the same category are endowed with higher probability weight parameters, the pixel regions of different categories are endowed with low probability weight parameters, and the method is represented by the formula:

v implementation, and the reconstructed feature map is named self-attribute feature map.

At this time, the reconstructed feature map does not have the characteristics and information of the feature map, and is carried with the feature map

Each of which is connected with

Characteristics and information of relationships between corresponding context regions.

The specific formula is implemented as follows:

wherein s is_xIs a single-channel feature map pixel point set belonging to different feature maps, d_xRepresenting different channels generated in the transmission process of the characteristic diagram, c represents 1 × 1 convolution, reshape₁Representing a transfer function, which functions to convert the three-dimensional feature map into a matrix of vectors.

As specific examples:

such as: the data set of the present embodiment corresponds to 19 categories, so the corresponding vector number is 19. Example (b)E.g., for variable Q, input

Output [ n,19, w1, h1 via 1 × 1conv]Then passes through reshape₁Output [ n,19, w1 × h1]Finally output [ n, w1 × h1,19 through tran]。

Variable K, e.g. input

Output [ n,19, h2, w2 via 1 × 1conv]Then passes through reshape₁Output [ n,19, h2 xw 2]。

For V, same Q operation, input

Finally output [ n, w2 × h2,19]. BMM denotes the multiplication of two matrices and Softmax denotes the normalized exponential function.

Reshape for equation (6)₂The dimension h is the dimension of input tran (BMM (V, SIM)) with dim being 0₁×w₁And converting the pixel set of the channel into a feature map of a single-channel two-dimensional plane, wherein the dimension is the same as the label dimension so as to be lost with the label.

Through the five groups of formulas, the relationship between each pixel and the context thereof can be obtained, and the relationship between each area and the corresponding pixel can also be calculated.

Wherein the content of the first and second substances,

is a robust feature graph with a relationship between pixels and context.

The advantage of the above scheme is that the robust feature map with the relationship between the pixel and the context finally obtained in this embodiment considers not only the relationship between the regions, the relationship between the pixel and the periphery of the pixel, but also the relationship between the pixel and the regions, so that similar pixels are aggregated.

Finally, the CONT is connected with

Performing concat, inputting the concat into 1 multiplied by 1 convolution, and then outputting a feature map with the same size as the input picture

By using

And (4) making loss with a real label, wherein the loss function is the same as the formula (1).

Preferably, the present embodiment optimizes the system model by a primary loss function and a secondary loss function, the ratio of the primary loss function to the secondary loss function being 1, 0.4, respectively.

In order to embody the fairness of the experiment, the PSPNet model and the system model are repeatedly used for carrying out a comparison experiment on the cityscape data set, and the device selected by the user is GPU Tesla V10032 GB. The training method is to input 10 pictures of bachsize, and the number of one-time iteration steps is 100, so as to ensure that the loss value changes stably as shown in fig. 5(a) -5 (b).

A total number of iteration steps 160000 is set. The optimizer was selected as SGD, learning rate 0.0025, momentum 0.9, two crossEntrophoploss, loss function as in equation (1). Two loss functions aid in system model training with each other, where PSPnet chooses ResNet50 as the backbone and we choose HRNetV2_ w18 as the backbone. Compared with the original model, the mIoU index of the network system is 1.7% higher than that of the PSPnet, the segmentation accuracy of the target is obviously improved, the larger the mIoU is, the more obvious the system advantage is, and the positioning of the target segmentation bounding box is more accurate. The mPA of this example is 0.5% higher than the mPA of PSPnet.

In the embodiment, a cityscape reference data set is used for carrying out experiments, and after comparison of the PSPnet algorithm, the obtained mlio u and IoU values are shown in the following table, and as can be seen from tables 1 and 2, the network performance is better.

TABLE 1

TABLE 2

5(a) -5 (b) are graphs of loss value and learning rate decrease change of the last 24 hours in the process of intercepting training in the embodiment of the present invention; as shown in fig. 5(a) -5 (b), the visual display on the precision of the embodiment is shown, and the scene samples in the verification set are randomly selected for test verification, so as to visually display the system practicability.

Fig. 6(a) -6 (c) are original pictures of segmentation maps in different scenes according to the embodiment of the present invention; fig. 7(a) -7 (c) are segmentation map labels under different scenarios in the embodiment of the present invention; fig. 8(a) -8 (c) are diagrams illustrating the segmentation effect in different scenarios according to the embodiment of the present invention. The method of the embodiment can be used for correctly dividing different categories, such as pedestrians, vehicles, people riding bicycles and traffic lights, and can also be used for dividing different categories into different colors. In the method of the embodiment, the HRNetV2_ w18 is used as the backbone, so that the parameter quantity is small, and the practicability is stronger. The method can effectively divide different complex scenes, is not interfered by the surrounding environment, and has stronger anti-interference capability. The method performs well in an environment with complex and variable data.

Those skilled in the art can understand that the detection contents for different types of devices in the polling process are different, and the detection contents can be preset according to the types of the devices. And after the image of the equipment to be detected is acquired, detecting according to the type of the equipment. Specifically, a deep learning target detection algorithm is adopted to realize automatic identification of the equipment state. In the embodiment, the deep learning model is deployed in the embedded AI analysis module, so that front-end deployment is realized, and the analysis real-time performance of the patrol video is improved.

Example two

The embodiment provides an image semantic segmentation system based on a dual-channel and self-attention mechanism, which comprises:

an image acquisition module configured to: acquiring a picture to be segmented;

EXAMPLE III

The embodiment of the specification provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the image semantic segmentation method based on the dual-channel and self-attention mechanism in the first embodiment.

Example four

The implementation manner of the present specification provides a computer readable storage medium, on which a computer program is stored, wherein the program is executed by a processor to implement the steps of the image semantic segmentation method based on the dual-channel and self-attention mechanism in the first embodiment.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. The image semantic segmentation method based on the double-channel and self-attention mechanism is characterized by comprising the following steps of:

acquiring a picture to be segmented;

2. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 1, wherein the extracting of the multi-scale context information feature map by the first channel specifically comprises: inputting the picture to be segmented into an HRNetV2_ w18 network to obtain a first feature map of a first channel, convolving the first feature map of the first channel to obtain a second feature map of the first channel, and extracting multi-scale context information from the second feature map of the first channel through a pyramid pooling model to obtain a third feature map of the first channel.

3. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 2, wherein after the third feature map of the first channel is obtained, a random inactivation dropout and a random inactivation neuron are added, and after the convolution, a cross entropy loss function is added to perform network aided training to obtain a fourth feature map of the first channel, namely a multi-scale context information feature map.

4. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 1, wherein the second channel extracting the pixel-level feature map specifically comprises: and inputting the picture to be segmented into an HRNetV2_ w18 network to obtain a first feature map of a second channel, and extracting a second feature map of the second channel, namely a pixel-level feature map, from the first feature map of the second channel through convolution, an acceleration neural network and a Relu function.

5. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 1, wherein the performing matrix operation on the multi-scale context information feature map and the pixel-level feature map specifically comprises: matrix multiplying pixels of each position of the multi-scale region context information feature map and the pixel level feature map, wherein the matrix multiplying comprises the following steps:

6. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 1, wherein the self-attention mechanism learning includes learning a relationship between a pixel at each position and a context region of the pixel, and generating a corresponding feature map according to the relationship between the pixel at each position and the context region thereof;

7. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 1, wherein the feature map having a relationship between each pixel and its corresponding context area specifically comprises:

8. The image semantic segmentation system based on the double-channel and self-attention mechanism is characterized by comprising the following steps of:

an image acquisition module configured to: acquiring a picture to be segmented;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method for semantic segmentation of images based on a two-channel and self-attention mechanism according to any one of claims 1 to 7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for image semantic segmentation based on the two-channel and self-attention mechanism according to any one of claims 1 to 7.