CN113902753A - Image semantic segmentation method and system based on dual-channel and self-attention mechanism - Google Patents

Image semantic segmentation method and system based on dual-channel and self-attention mechanism Download PDF

Info

Publication number
CN113902753A
CN113902753A CN202111129122.7A CN202111129122A CN113902753A CN 113902753 A CN113902753 A CN 113902753A CN 202111129122 A CN202111129122 A CN 202111129122A CN 113902753 A CN113902753 A CN 113902753A
Authority
CN
China
Prior art keywords
feature map
pixel
channel
matrix
context
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111129122.7A
Other languages
Chinese (zh)
Inventor
李天平
魏艳军
严业金
丁同贺
欧佳瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN202111129122.7A priority Critical patent/CN113902753A/en
Publication of CN113902753A publication Critical patent/CN113902753A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image semantic segmentation method and system based on dual channels and a self-attention mechanism, wherein the method comprises the following steps: acquiring a picture to be segmented; respectively extracting feature maps of two channels from a picture to be segmented; extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel; obtaining a characteristic diagram of each pixel which is related to a corresponding context area by the multi-scale context information characteristic diagram and the pixel level characteristic diagram through matrix operation and self-attention mechanism learning; and inputting the characteristic graph of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the picture.

Description

Image semantic segmentation method and system based on dual-channel and self-attention mechanism
Technical Field
The invention belongs to the technical field of computer vision and image processing, and particularly relates to an image semantic segmentation method and system based on a dual-channel and self-attention mechanism.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The semantic segmentation technology not only belongs to the research category of computer science, has important application significance for image restoration by the semantic segmentation of research images, but also relates to a front-edge research field of multiple subjects such as medical images, automatic driving, satellite remote sensing and the like, and has very important research significance and application value.
At present, the semantic segmentation technology generally adopts traditional convolutional neural networks, such as VGG, ResNet and the like, but the structures of the traditional convolutional neural networks are complex, the calculation amount is large, and the difference exists between the real-time performance and the application landing of the semantic segmentation.
The traditional network continuously extracts the features of the input picture, and through multilayer convolution and down-sampling operations, the resolution of the feature map is greatly reduced and the original information is lost, so that more and more network designs try to reduce the loss of the resolution, for example, the more and more network designs use an ASPP module consisting of cavity convolution to extract the semantic information of a multi-scale receptive field, wherein an encoder module is integrated with the cavity convolution or uses convolution operation with the step length of 2 to replace pooling, the contradiction between the step length and the receptive field and the feature map information is solved, although the effect is improved, the calculated amount is large, the cavity convolution cannot generate dense context information, and the excessive use of the cavity convolution can also generate a grid effect, so the problem of semantic segmentation is not fundamentally solved, and the semantic segmentation is a pixel-level segmentation task, therefore, the relationship among convolution, step size, resolution, receptive field and context is comprehensively considered; obviously, if only the ASPP module is used, the mesh effect is generated, the overall information of the context is lost, and the method is not a good solution for the intensive segmentation task.
The PSPnet adds a pyramid pooling module behind the extracted features, solves the problem of global information loss, has performance superior to that of Deeplab, but lacks the relationship between pixels and contexts, and the two methods have a common defect that the traditional network is used, so that the parameter quantity is too large, and the resolution loss is inevitably caused when the features are extracted deeply in the network. Therefore, introducing a network that keeps the high resolution feature extraction and the computation amount small becomes necessary for the precision and speed improvement of semantic segmentation.
In addition, semantic segmentation is a pixel-level segmentation task, and the conventional network only considers the relationship between pixels, such as a deep series network, or only considers the information between regions, like a pyramid pooling structure module proposed by PSPnet.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an image semantic segmentation method and system based on a dual-channel and self-attention mechanism, the balance between complexity and precision is comprehensively considered, the HRNetV2 network is used as a backbone to extract the features of an input picture, the self-attention mechanism is combined with a pyramid pooling module in a PSPNet to learn between pixels and regions of the feature image extracted by the HRNetV2 network, and further the precision is improved.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
the image semantic segmentation method based on the double-channel and self-attention mechanism comprises the following steps:
acquiring a picture to be segmented;
respectively extracting feature maps of two channels from a picture to be segmented; extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel;
obtaining a characteristic diagram of each pixel which is related to a corresponding context area by the multi-scale context information characteristic diagram and the pixel level characteristic diagram through matrix operation and self-attention mechanism learning;
and inputting the characteristic graph of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the picture.
Further, the extracting of the multi-scale contextual information feature map by the first channel specifically includes: inputting the picture to be segmented into an HRNetV2_ w18 network to obtain a first feature map of a first channel, convolving the first feature map of the first channel to obtain a second feature map of the first channel, and extracting multi-scale context information from the second feature map of the first channel through a pyramid pooling model to obtain a third feature map of the first channel.
Further, after the third feature map of the first channel is obtained, adding random inactivation dropout and random inactivation neurons, and after convolution, adding a cross entropy loss function to perform network aided training to obtain a fourth feature map of the first channel, namely a multi-scale context information feature map.
Further, the extracting of the pixel-level feature map by the second channel specifically includes: and inputting the picture to be segmented into an HRNetV2_ w18 network to obtain a first feature map of a second channel, and extracting a second feature map of the second channel, namely a pixel-level feature map, from the first feature map of the second channel through convolution, an acceleration neural network and a Relu function.
Further, the matrix operation of the multi-scale context information feature map and the pixel level feature map specifically includes: matrix multiplying pixels of each position of the multi-scale region context information feature map and the pixel level feature map, wherein the matrix multiplying comprises the following steps:
preprocessing a multi-scale region context information characteristic graph and outputting the preprocessed multi-scale region context information characteristic graph to obtain a first matrix;
and outputting a second matrix after the pixel level characteristic diagram is subjected to normalization processing, multiplying the first matrix and the second matrix, and outputting a third matrix characteristic diagram, namely the fused pixel level region context characteristic diagram.
Further, the self-attention mechanism learning comprises learning the relationship between the pixel of each position and the context area of the pixel, and generating a corresponding feature map according to the relationship between the pixel of each position and the context area of the pixel;
the method specifically comprises the following steps: respectively inputting the fused pixel level region context characteristic diagram and the fused pixel level characteristic diagram into an attention mechanism to obtain three matrixes Q, K and V;
the Q matrix is a matrix used for inquiring the relation of the K matrix generated by the pixel level characteristic diagram; the K matrix is a matrix provided by Q-equal matrix checked generated by the fused pixel level region context characteristic diagram; the V matrix is a matrix with actual information and characteristic attributes generated by the fused pixel-level region context characteristic diagram.
Further, the feature map having a relationship between each pixel and its corresponding context area specifically includes:
the pixel level feature map queries the relationship between the pixel level feature map and the fused pixel level region context feature map through a Q matrix, the fused pixel level region context feature map queries through a K matrix to obtain a query result, and the query result is converted into a probability weight matrix of the corresponding relationship between the pixel level feature map and the fused pixel level region context feature map; and reconstructing the V matrix according to the probability weight matrix, and endowing the pixel regions of the same category and the pixel regions of different categories with probability weight parameters.
One or more embodiments provide an image semantic segmentation system based on dual channels and a self-attention mechanism, comprising:
an image acquisition module configured to: acquiring a picture to be segmented;
a two-channel feature map extraction module configured to: respectively extracting feature maps of two channels from a picture to be segmented; extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel;
a feature fusion module configured to: obtaining a characteristic diagram of each pixel which is related to a corresponding context area by the multi-scale context information characteristic diagram and the pixel level characteristic diagram through matrix operation and self-attention mechanism learning;
a semantic segmentation module configured to: and inputting the characteristic graph of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the picture.
One or more embodiments provide a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps of any of the two-channel and self-attention mechanism based image semantic segmentation methods as described above.
One or more embodiments provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the dual channel and autofocusing mechanism-based image semantic segmentation methods described above.
The above one or more technical solutions have the following beneficial effects:
(1) the method comprehensively considers the balance between the complexity and the precision of the network, the HRNeTV2 network is used as a backbone to extract the characteristics of the input picture, and the self-attention mechanism is combined with a pyramid pooling module in the PSPNet to learn the characteristics of the HRNeTV2 network among pixels and regions, so that the precision is improved.
(2) The semantic segmentation technology and the semantic segmentation system provided by the invention can optimize the network performance in the training process, strengthen the relation between the pixels and the context information, enable the network to pay attention to only the regions and the relation between the pixels and the regions, and do not segment all the regions of the whole feature map together, and the above scheme can improve the segmentation precision and can not cause wrong segmentation. The HRNeTV2 is used as a backbone instead of ResNet, and is mainly a light-weight feature extraction structure, low in computational power requirement, excellent in performance and high in model training speed. In general, the network system model is simple in structure, small in calculation amount and greatly improved in accuracy.
(3) In order to prevent network overfitting, random inactivated dropout is added, a Cross Entropyloss function is added to perform network aided training, the obtained feature graph has context semantic information, the feature graphs are obtained through multi-scale average pooling and are fused, and the fused feature graph has hierarchical and diverse context global semantic information. The auxiliary training has the effects of accelerating the training convergence speed of the system model and extracting richer context semantic information areas.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flowchart of an image semantic segmentation method based on dual channels and a self-attention mechanism according to an embodiment of the present invention;
FIG. 2 is a structural diagram of an image semantic segmentation method based on dual channels and a self-attention mechanism in an embodiment of the present invention;
FIG. 3 is an overall framework diagram of a pyramid scene analysis network in an embodiment of the present invention;
FIG. 4 is a diagram of a self-attention mechanism in an embodiment of the present invention;
FIG. 5 is a graph of the loss value and the learning rate decrease change in the last 24 hours of the training process according to the embodiment of the present invention;
6(a) -6 (c) are original pictures of segmentation maps under different scenes in the embodiment of the present invention;
FIGS. 7(a) -7 (c) are segmentation map labels for different scenarios in embodiments of the present invention;
fig. 8(a) -8 (c) are graphs of the segmentation effect under different scenarios in the embodiment of the present invention.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
As shown in fig. 1, the present embodiment discloses an image semantic segmentation method based on dual channels and a self-attention mechanism, which includes the following steps:
s101, acquiring a picture to be segmented;
s102, respectively extracting feature maps of two channels from the picture to be segmented;
extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel;
and S103, obtaining a robustness characteristic diagram of each pixel in relation to the corresponding context region by matrix operation and self-attention mechanism learning of the multi-scale region context information characteristic diagram and the pixel level characteristic diagram.
And S104, inputting the feature map of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the image.
As shown in fig. 2-3, as one or more embodiments, in S102, extracting the multi-scale contextual information feature map by the first channel specifically includes:
inputting the picture to be segmented into HRNetV2_ w18 network to obtain a first feature map of a first channel
Figure BDA0003279834400000071
Convolving the first characteristic diagram of the first channel by 1 x 1conv to obtain a second characteristic diagram of the first channel
Figure BDA0003279834400000072
Extracting the second characteristic diagram of the first channel through a Pyramid Pooling Model (PPM) to obtain the first context informationThird characteristic diagram of the channel
Figure BDA0003279834400000073
Wherein, the characteristic diagram is
Figure BDA0003279834400000074
And (3) representing, wherein x represents a few channels, and y represents a few characteristic diagrams of corresponding channels.
In order to prevent overfitting of the network, random inactivation dropout is added into the third feature map of the first channel, random inactivation neurons with the probability of 0.1 are added, and a cross entropy loss function (cross entropy) function is added after 1 × 1 convolution to carry out network auxiliary training to obtain a final fourth feature map of the first channel
Figure BDA0003279834400000075
Namely a multi-scale context information characteristic diagram;
the cross Entropyloss function is calculated as follows:
Figure BDA0003279834400000076
wherein class represents a label belonging to a certain class, input represents a feature vector of an input feature map, and input [ j ] represents a score of a certain class.
The scheme has the advantages that the CrossEntropyloss function is added through randomly inactivating dropout so as to accelerate system model convergence and extract richer regions.
The HRNetV2_ w18 is a lightweight backbone with respect to resnet, and maintains the same size as the original image at the time of feature extraction, and the process of feature extraction is a process of maintaining high resolution.
In the traditional feature extraction, the resolution is lost for feature extraction, the network is deep, and the requirement on computing power is high. In consideration of computational power and accuracy, the present embodiment selects HRNetV2_ w18 as the backbone.
As one or more embodiments, the pyramid poolThe pooling (PPM) model outputs pooling of different sizes for feature extraction of the feature map, and the pooling uses AdaptAvgPool 2D (I)x) Indicating that AdaptAvgPool 2D is a pooling function, IxFeature maps representing different sizes of pooled output height and width, Ix=[1,1],Ix=[2,2],Ix=[3,3],Ix=[6,6]The number of channels was 67.
Therefore, rich context areas are extracted through pooling at different scales.
In S102, the extracting the pixel-level feature map in the second channel specifically includes:
inputting the picture to be divided into HRNetV2_ w18 network to obtain the first feature map of the second channel
Figure BDA0003279834400000081
Extracting the second characteristic diagram of the second channel after the first characteristic diagram of the second channel is subjected to 1 × 1conv convolution, the acceleration neural network BatchNorm and the Relu function
Figure BDA0003279834400000082
I.e. a pixel level signature.
As one or more embodiments, the matrix operation of the multi-scale region context information feature map and the pixel level feature map specifically includes:
matrix multiplying the pixels at each position of the multi-scale region context information characteristic diagram and the pixel level characteristic diagram; preprocessing a multi-scale region context information characteristic graph and outputting the preprocessed multi-scale region context information characteristic graph to obtain a first matrix; and outputting a second matrix after the pixel level characteristic diagram is subjected to normalization processing, multiplying the first matrix and the second matrix, and outputting a third matrix characteristic diagram, namely the fused pixel level region context characteristic diagram.
The specific implementation mode is as follows:
a fourth feature map of the first channel
Figure BDA0003279834400000083
Passes through reshape1Output [ n, c, hXw ]]Then outputs [ n, h × w, c ] through tanspose]Named U, where n represents the number of pictures input at one time. c represents the number of channels of the feature map. h denotes the height of the feature map, and w denotes the width of the feature map.
Second feature map of second channel
Figure BDA0003279834400000084
Passes through reshape1Output [ n, k, hXw ] after Softmax normalization processing]Named as I.
Multiplying the two matrixes of I and U, and finally outputting the multiplied result through a transpose unscueze function [ n, c, k1]The characteristic diagram is named
Figure BDA0003279834400000085
It is that
Figure BDA0003279834400000086
By the feature map after multi-region feature extraction
Figure BDA0003279834400000091
And
Figure BDA0003279834400000092
pixel multiplication of each position of the feature map by
Figure BDA0003279834400000093
To enhance the pixel
Figure BDA0003279834400000094
The characteristic diagram obtained by the relation between the regional characteristics has more obvious distinction between the regions of the characteristic diagram at this time and has prominent contextual information.
Therefore, the class information among the regions is more refined, and the pixels among each region in the feature map are enhanced and aggregated to obtain a richer feature map.
As one or more embodiments, as shown in fig. 4, the process of self-attention mechanism learning includes: inputting the result of matrix operation of the multi-scale context information characteristic diagram and the pixel level characteristic diagram into a self-attention mechanism, and learning to obtain the relationship between the pixel at each position and the region of the pixel;
learning the relationship between the pixel at each position and the context area of the pixel by a self-attention mechanism, and generating a corresponding feature map according to the relationship between the pixel at each position and the context area of the pixel, specifically:
will be provided with
Figure BDA0003279834400000095
Inputting the feature map into a self-attention mechanism, obtaining a feature map by using 1 multiplied by 1conv and the number of convolution kernels being 256, and then flattening the feature map in the dimensions of width and height to obtain a Key matrix;
similarly, the input image pixel level feature map is focused on the mechanism, using 1 x 1conv, with 256 convolution kernels, and the resulting feature map is flattened in the height and width dimensions to obtain a Query matrix, while,
Figure BDA0003279834400000096
and flattening in the width and height dimensions through 256 1 × 1conv to obtain a Value matrix. Here we will refer to Query, Key, Value matrix, for short Q, K, V. The Q matrix is a matrix used for inquiring the relation of the K matrix generated by the pixel level characteristic diagram; the K matrix is a matrix provided by Q-equal matrix checked generated by the fused pixel level region context characteristic diagram; the V matrix is a matrix with actual information and characteristic attributes generated by the fused pixel-level region context characteristic diagram.
The embodiment is to find
Figure BDA0003279834400000097
Each of which is connected with
Figure BDA0003279834400000098
The relation matrix between the corresponding areas comprises the following steps:
the pixel level feature map queries the relationship between the pixel level feature map and the fused pixel level region context feature map through a Q matrix, the fused pixel level region context feature map queries through a K matrix to obtain a query result, and the query result is converted into a probability weight matrix of the corresponding relationship between the pixel level feature map and the fused pixel level region context feature map; and reconstructing the V matrix according to the probability weight matrix, and endowing the pixel regions of the same category and the pixel regions of different categories with probability weight parameters.
For example:
step 1:
Figure BDA0003279834400000101
to take its own query matrix Q to query
Figure BDA0003279834400000102
In the case of
Figure BDA0003279834400000103
To provide its own K matrix, the query results utilize QKTThe realization is that the cosine angle of the included angle is close to 0 for the pixels belonging to the same class of labels, so the values of inner products of the pixels are very large, and the values of the inner products of the pixels belonging to different classes of labels are very small.
Step 2: will QKTUsing the normalized exponential function softmax, transform
Figure BDA0003279834400000104
And
Figure BDA0003279834400000105
into corresponding relational probability weight matrix
Figure BDA0003279834400000106
And step 3: the V matrix is reconstructed, the pixel regions of the same category are endowed with higher probability weight parameters, the pixel regions of different categories are endowed with low probability weight parameters, and the method is represented by the formula:
Figure BDA0003279834400000107
v implementation, and the reconstructed feature map is named self-attribute feature map.
At this time, the reconstructed feature map does not have the characteristics and information of the feature map, and is carried with the feature map
Figure BDA0003279834400000108
Each of which is connected with
Figure BDA0003279834400000109
Characteristics and information of relationships between corresponding context regions.
The specific formula is implemented as follows:
Figure BDA00032798344000001010
Figure BDA00032798344000001011
Figure BDA00032798344000001012
Figure BDA00032798344000001013
Figure BDA00032798344000001014
wherein s isxIs a single-channel feature map pixel point set belonging to different feature maps, dxRepresenting different channels generated in the transmission process of the characteristic diagram, c represents 1 × 1 convolution, reshape1Representing a transfer function, which functions to convert the three-dimensional feature map into a matrix of vectors.
As specific examples:
such as: the data set of the present embodiment corresponds to 19 categories, so the corresponding vector number is 19. Example (b)E.g., for variable Q, input
Figure BDA0003279834400000111
Output [ n,19, w1, h1 via 1 × 1conv]Then passes through reshape1Output [ n,19, w1 × h1]Finally output [ n, w1 × h1,19 through tran]。
Variable K, e.g. input
Figure BDA0003279834400000112
Output [ n,19, h2, w2 via 1 × 1conv]Then passes through reshape1Output [ n,19, h2 xw 2]。
For V, same Q operation, input
Figure BDA0003279834400000113
Finally output [ n, w2 × h2,19]. BMM denotes the multiplication of two matrices and Softmax denotes the normalized exponential function.
Reshape for equation (6)2The dimension h is the dimension of input tran (BMM (V, SIM)) with dim being 01×w1And converting the pixel set of the channel into a feature map of a single-channel two-dimensional plane, wherein the dimension is the same as the label dimension so as to be lost with the label.
Through the five groups of formulas, the relationship between each pixel and the context thereof can be obtained, and the relationship between each area and the corresponding pixel can also be calculated.
Wherein the content of the first and second substances,
Figure BDA0003279834400000114
is a robust feature graph with a relationship between pixels and context.
The advantage of the above scheme is that the robust feature map with the relationship between the pixel and the context finally obtained in this embodiment considers not only the relationship between the regions, the relationship between the pixel and the periphery of the pixel, but also the relationship between the pixel and the regions, so that similar pixels are aggregated.
Finally, the CONT is connected with
Figure BDA0003279834400000115
Performing concat, inputting the concat into 1 multiplied by 1 convolution, and then outputting a feature map with the same size as the input picture
Figure BDA0003279834400000116
By using
Figure BDA0003279834400000117
And (4) making loss with a real label, wherein the loss function is the same as the formula (1).
Preferably, the present embodiment optimizes the system model by a primary loss function and a secondary loss function, the ratio of the primary loss function to the secondary loss function being 1, 0.4, respectively.
In order to embody the fairness of the experiment, the PSPNet model and the system model are repeatedly used for carrying out a comparison experiment on the cityscape data set, and the device selected by the user is GPU Tesla V10032 GB. The training method is to input 10 pictures of bachsize, and the number of one-time iteration steps is 100, so as to ensure that the loss value changes stably as shown in fig. 5(a) -5 (b).
A total number of iteration steps 160000 is set. The optimizer was selected as SGD, learning rate 0.0025, momentum 0.9, two crossEntrophoploss, loss function as in equation (1). Two loss functions aid in system model training with each other, where PSPnet chooses ResNet50 as the backbone and we choose HRNetV2_ w18 as the backbone. Compared with the original model, the mIoU index of the network system is 1.7% higher than that of the PSPnet, the segmentation accuracy of the target is obviously improved, the larger the mIoU is, the more obvious the system advantage is, and the positioning of the target segmentation bounding box is more accurate. The mPA of this example is 0.5% higher than the mPA of PSPnet.
In the embodiment, a cityscape reference data set is used for carrying out experiments, and after comparison of the PSPnet algorithm, the obtained mlio u and IoU values are shown in the following table, and as can be seen from tables 1 and 2, the network performance is better.
TABLE 1
Figure BDA0003279834400000121
TABLE 2
Figure BDA0003279834400000122
5(a) -5 (b) are graphs of loss value and learning rate decrease change of the last 24 hours in the process of intercepting training in the embodiment of the present invention; as shown in fig. 5(a) -5 (b), the visual display on the precision of the embodiment is shown, and the scene samples in the verification set are randomly selected for test verification, so as to visually display the system practicability.
Fig. 6(a) -6 (c) are original pictures of segmentation maps in different scenes according to the embodiment of the present invention; fig. 7(a) -7 (c) are segmentation map labels under different scenarios in the embodiment of the present invention; fig. 8(a) -8 (c) are diagrams illustrating the segmentation effect in different scenarios according to the embodiment of the present invention. The method of the embodiment can be used for correctly dividing different categories, such as pedestrians, vehicles, people riding bicycles and traffic lights, and can also be used for dividing different categories into different colors. In the method of the embodiment, the HRNetV2_ w18 is used as the backbone, so that the parameter quantity is small, and the practicability is stronger. The method can effectively divide different complex scenes, is not interfered by the surrounding environment, and has stronger anti-interference capability. The method performs well in an environment with complex and variable data.
Those skilled in the art can understand that the detection contents for different types of devices in the polling process are different, and the detection contents can be preset according to the types of the devices. And after the image of the equipment to be detected is acquired, detecting according to the type of the equipment. Specifically, a deep learning target detection algorithm is adopted to realize automatic identification of the equipment state. In the embodiment, the deep learning model is deployed in the embedded AI analysis module, so that front-end deployment is realized, and the analysis real-time performance of the patrol video is improved.
Example two
The embodiment provides an image semantic segmentation system based on a dual-channel and self-attention mechanism, which comprises:
an image acquisition module configured to: acquiring a picture to be segmented;
a two-channel feature map extraction module configured to: respectively extracting feature maps of two channels from a picture to be segmented; extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel;
a feature fusion module configured to: obtaining a characteristic diagram of each pixel which is related to a corresponding context area by the multi-scale context information characteristic diagram and the pixel level characteristic diagram through matrix operation and self-attention mechanism learning;
a semantic segmentation module configured to: and inputting the characteristic graph of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the picture.
EXAMPLE III
The embodiment of the specification provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the image semantic segmentation method based on the dual-channel and self-attention mechanism in the first embodiment.
Example four
The implementation manner of the present specification provides a computer readable storage medium, on which a computer program is stored, wherein the program is executed by a processor to implement the steps of the image semantic segmentation method based on the dual-channel and self-attention mechanism in the first embodiment.
Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims (10)

1. The image semantic segmentation method based on the double-channel and self-attention mechanism is characterized by comprising the following steps of:
acquiring a picture to be segmented;
respectively extracting feature maps of two channels from a picture to be segmented; extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel;
obtaining a characteristic diagram of each pixel which is related to a corresponding context area by the multi-scale context information characteristic diagram and the pixel level characteristic diagram through matrix operation and self-attention mechanism learning;
and inputting the characteristic graph of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the picture.
2. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 1, wherein the extracting of the multi-scale context information feature map by the first channel specifically comprises: inputting the picture to be segmented into an HRNetV2_ w18 network to obtain a first feature map of a first channel, convolving the first feature map of the first channel to obtain a second feature map of the first channel, and extracting multi-scale context information from the second feature map of the first channel through a pyramid pooling model to obtain a third feature map of the first channel.
3. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 2, wherein after the third feature map of the first channel is obtained, a random inactivation dropout and a random inactivation neuron are added, and after the convolution, a cross entropy loss function is added to perform network aided training to obtain a fourth feature map of the first channel, namely a multi-scale context information feature map.
4. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 1, wherein the second channel extracting the pixel-level feature map specifically comprises: and inputting the picture to be segmented into an HRNetV2_ w18 network to obtain a first feature map of a second channel, and extracting a second feature map of the second channel, namely a pixel-level feature map, from the first feature map of the second channel through convolution, an acceleration neural network and a Relu function.
5. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 1, wherein the performing matrix operation on the multi-scale context information feature map and the pixel-level feature map specifically comprises: matrix multiplying pixels of each position of the multi-scale region context information feature map and the pixel level feature map, wherein the matrix multiplying comprises the following steps:
preprocessing a multi-scale region context information characteristic graph and outputting the preprocessed multi-scale region context information characteristic graph to obtain a first matrix;
and outputting a second matrix after the pixel level characteristic diagram is subjected to normalization processing, multiplying the first matrix and the second matrix, and outputting a third matrix characteristic diagram, namely the fused pixel level region context characteristic diagram.
6. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 1, wherein the self-attention mechanism learning includes learning a relationship between a pixel at each position and a context region of the pixel, and generating a corresponding feature map according to the relationship between the pixel at each position and the context region thereof;
the method specifically comprises the following steps: respectively inputting the fused pixel level region context characteristic diagram and the fused pixel level characteristic diagram into an attention mechanism to obtain three matrixes Q, K and V;
the Q matrix is a matrix used for inquiring the relation of the K matrix generated by the pixel level characteristic diagram; the K matrix is a matrix provided by Q-equal matrix checked generated by the fused pixel level region context characteristic diagram; the V matrix is a matrix with actual information and characteristic attributes generated by the fused pixel-level region context characteristic diagram.
7. The image semantic segmentation method based on the dual-channel and self-attention mechanism as claimed in claim 1, wherein the feature map having a relationship between each pixel and its corresponding context area specifically comprises:
the pixel level feature map queries the relationship between the pixel level feature map and the fused pixel level region context feature map through a Q matrix, the fused pixel level region context feature map queries through a K matrix to obtain a query result, and the query result is converted into a probability weight matrix of the corresponding relationship between the pixel level feature map and the fused pixel level region context feature map; and reconstructing the V matrix according to the probability weight matrix, and endowing the pixel regions of the same category and the pixel regions of different categories with probability weight parameters.
8. The image semantic segmentation system based on the double-channel and self-attention mechanism is characterized by comprising the following steps of:
an image acquisition module configured to: acquiring a picture to be segmented;
a two-channel feature map extraction module configured to: respectively extracting feature maps of two channels from a picture to be segmented; extracting a multi-scale context information feature map by a first channel; extracting a pixel-level feature map in a second channel;
a feature fusion module configured to: obtaining a characteristic diagram of each pixel which is related to a corresponding context area by the multi-scale context information characteristic diagram and the pixel level characteristic diagram through matrix operation and self-attention mechanism learning;
a semantic segmentation module configured to: and inputting the characteristic graph of each pixel and the context region corresponding to the pixel into the trained classifier, and outputting the semantic segmentation result of the picture.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method for semantic segmentation of images based on a two-channel and self-attention mechanism according to any one of claims 1 to 7 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for image semantic segmentation based on the two-channel and self-attention mechanism according to any one of claims 1 to 7.
CN202111129122.7A 2021-09-26 2021-09-26 Image semantic segmentation method and system based on dual-channel and self-attention mechanism Pending CN113902753A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111129122.7A CN113902753A (en) 2021-09-26 2021-09-26 Image semantic segmentation method and system based on dual-channel and self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111129122.7A CN113902753A (en) 2021-09-26 2021-09-26 Image semantic segmentation method and system based on dual-channel and self-attention mechanism

Publications (1)

Publication Number Publication Date
CN113902753A true CN113902753A (en) 2022-01-07

Family

ID=79029353

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111129122.7A Pending CN113902753A (en) 2021-09-26 2021-09-26 Image semantic segmentation method and system based on dual-channel and self-attention mechanism

Country Status (1)

Country Link
CN (1) CN113902753A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019192A (en) * 2022-05-30 2022-09-06 杭州电子科技大学 Flood change detection method and system based on dual-channel backbone network and joint loss function
CN115690592A (en) * 2023-01-05 2023-02-03 阿里巴巴(中国)有限公司 Image processing method and model training method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019192A (en) * 2022-05-30 2022-09-06 杭州电子科技大学 Flood change detection method and system based on dual-channel backbone network and joint loss function
CN115690592A (en) * 2023-01-05 2023-02-03 阿里巴巴(中国)有限公司 Image processing method and model training method
CN115690592B (en) * 2023-01-05 2023-04-25 阿里巴巴(中国)有限公司 Image processing method and model training method

Similar Documents

Publication Publication Date Title
CN110163187B (en) F-RCNN-based remote traffic sign detection and identification method
CN111310773B (en) Efficient license plate positioning method of convolutional neural network
CN114202696A (en) SAR target detection method and device based on context vision and storage medium
CN112132844A (en) Recursive non-local self-attention image segmentation method based on lightweight
CN110222718B (en) Image processing method and device
CN113902753A (en) Image semantic segmentation method and system based on dual-channel and self-attention mechanism
CN112800906A (en) Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile
CN114565045A (en) Remote sensing target detection knowledge distillation method based on feature separation attention
CN114202743A (en) Improved fast-RCNN-based small target detection method in automatic driving scene
CN115359372A (en) Unmanned aerial vehicle video moving object detection method based on optical flow network
CN111008979A (en) Robust night image semantic segmentation method
CN113066089A (en) Real-time image semantic segmentation network based on attention guide mechanism
CN113792641A (en) High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism
CN114764856A (en) Image semantic segmentation method and image semantic segmentation device
CN112149526A (en) Lane line detection method and system based on long-distance information fusion
CN116740516A (en) Target detection method and system based on multi-scale fusion feature extraction
CN114492634A (en) Fine-grained equipment image classification and identification method and system
CN112668662B (en) Outdoor mountain forest environment target detection method based on improved YOLOv3 network
CN116612288B (en) Multi-scale lightweight real-time semantic segmentation method and system
CN117132910A (en) Vehicle detection method and device for unmanned aerial vehicle and storage medium
CN114219757B (en) Intelligent damage assessment method for vehicle based on improved Mask R-CNN
CN115861595A (en) Multi-scale domain self-adaptive heterogeneous image matching method based on deep learning
CN114359907A (en) Semantic segmentation method, vehicle control method, electronic device, and storage medium
CN112101330B (en) Image processing method, image processing apparatus, electronic device, and storage medium
CN117542045B (en) Food identification method and system based on space-guided self-attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination