CN112580661B - Multi-scale edge detection method under deep supervision - Google Patents

Multi-scale edge detection method under deep supervision Download PDF

Info

Publication number
CN112580661B
CN112580661B CN202011445466.4A CN202011445466A CN112580661B CN 112580661 B CN112580661 B CN 112580661B CN 202011445466 A CN202011445466 A CN 202011445466A CN 112580661 B CN112580661 B CN 112580661B
Authority
CN
China
Prior art keywords
edge
attention
global
convolution
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011445466.4A
Other languages
Chinese (zh)
Other versions
CN112580661A (en
Inventor
孙俊
张旺
吴豪
吴小俊
方伟
陈祺东
李超
游琪
冒钟杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202011445466.4A priority Critical patent/CN112580661B/en
Publication of CN112580661A publication Critical patent/CN112580661A/en
Application granted granted Critical
Publication of CN112580661B publication Critical patent/CN112580661B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

A multi-scale edge detection method under deep supervision. The method can combine local features with their corresponding global correlations, adaptively recalibrate channel responses, direct the network to ignore irrelevant information, and emphasize correlations between related features. The effectiveness of the multi-scale, deep-supervision, self-care module algorithm was demonstrated by conducting a series of ablative experiments on the BSDS500 dataset and NYUD dataset. Compared with other most advanced edge detection networks, the algorithm has better performance, improves the prediction precision by using fewer parameters, realizes the score of 0.815 of ODS measurement value on the BSDS00 data set, and is 0.9 percent higher than the prior other algorithms.

Description

Multi-scale edge detection method under deep supervision
Technical Field
The invention belongs to the field of edge detection, and particularly relates to a multi-scale edge detection method under deep supervision.
Background
Edge detection aims at extracting object boundaries and visually distinct edges in natural images, which are important for advanced computer vision tasks such as image segmentation, object detection/recognition. As a basis for advanced tasks, edge detection has a rich history, and we now focus on several representative works that have proven significant. Early conventional approaches included Sobel detectors, zero crossing detection, and the widely used Canny detector. Pb, gPb, sktch token, and Structured Edges use complex learning paradigms to distinguish edge pixels based on manual features (e.g., brightness, color, gradient, and texture). However, it is difficult to represent the meaning of semantics using low-level visual cues.
The edges of the image are made up of meaningful local detail and object-level boundaries. Since CNN has strong ability to automatically learn advanced features of natural images, it is used for edge detection and achieves good effect, such as N 4 Fields, deep content, deep Edge and CSCNN. To obtain diverse edge scales, CNN-based HED and RCF supervise the predictions of different network layers through real edge graphs, lower layers detect more local detail, while higher layers capture object-level boundaries with larger receptive fields. At high recall, HED indicates that deep supervision can compromise low-level prediction, facilitating learning global object boundaries. Rich convolution features are very effective for many visual tasks, but HEDs andthe RCF still does not explicitly use global context information for training and prediction strategies of side outputs, and does not directly impose constraints on neighboring pixel labels to enhance depth supervision. Thus, we can improve the quality of the network representation by explicitly modeling channel correlation, the network can adaptively recalibrate the channel response, and learn to use global information to emphasize useful features and suppress less useful features.
As shown in fig. 1, as the receptive field size becomes larger, we find that the edges captured by the different convolution layers become progressively coarser and lose much of the useful detail. The purpose of capturing long-range correlations is to extract a global understanding of the visual scene, which has proven useful for a wide range of recognition tasks, such as image/video classification, object detection and segmentation, which are required by RCFs. In CNN, long-range correlation is modeled mainly by depth-stacking convolution layers, since the convolution layers establish pixel relationships in local neighborhoods. However, direct repeated convolution layers are computationally inefficient, difficult to optimize, and difficult to transfer information between remote locations, which results in ineffective modeling of remote correlations. To solve this problem, we model the global context to form an attention map, and then aggregate the features of all locations with weights defined by the attention map. Finally, the aggregated features and the features for each location are added to form a new feature.
Disclosure of Invention
The invention aims to provide a multi-scale edge detection method under deep supervision, which solves the problems existing in the prior art.
The technical scheme of the invention is as follows:
a multi-scale edge detection method under depth supervision comprises the following specific steps:
(1) The constructed edge detector comprises an improved VGG16 network and an attention module; the improved VGG16 network removes the fifth pooling layer and all full connection layers of the original VGG16, and keeps 13 convolution layers and the first four pooling layers; the attention module is composed of a global module and a channel module, wherein the global module comprises a 1 multiplied by 1 convolution layer and a softmax function layer, the channel module comprises a bottleneck structure, a normalization layer and a Relu activation layer, the bottleneck structure comprises two full connection layers, and each full connection layer is a 1 multiplied by 1 convolution layer;
(2) Initializing the improved VGG16 network using the VGG16 pre-trained on ImageNet;
(3) Expanding the images of the data sets by using rotation, flipping and scaling, adjusting the sizes of the images by 0.5, 1.0 and 1.5 times to construct image pyramids, and sequentially inputting the image pyramids of each data set into an edge detector;
(4) The improved VGG16 network carries out phase 1 to phase 4 convolution operation on the input data set image, the attention module carries out 1X 1 convolution operation on the output of the 4 th phase, the operation result is input into a softmax function to obtain a global context attention figure, and the global context attention figure is shared with each channel of the output characteristic of the 4 th phase; the dimension of the output characteristic channel of the 4 th stage fused with the global context attention is reduced by using one full connection layer in the bottleneck structure, and the dimension-reduced global context attention is normalized by using LayerNorm; inputting the normalized data of each channel into a ReLU activation function, and then increasing the channel dimension to be reduced by another full-connection layer in the bottleneck structure to obtain the characteristics which are fused with global characteristics and adjust the response among channels; inputting the obtained features which are fused with the global features and adjust the response among channels into a convolution layer of the stage 5, and carrying out the stage 5 convolution operation; then downsampling the output of each convolution layer from the stage 1 to the stage 5, and extracting multi-scale features to obtain a multi-scale feature map;
(5) The global module of the attention module carries out 1X 1 convolution operation on the multi-scale feature obtained in the step (4), inputs the operation result into a softmax function to obtain a global context attention pattern, and shares the global context attention pattern with each channel of the multi-scale feature;
(6) Using one full connection layer in the bottleneck structure to reduce the dimension of the multi-scale feature channel fused with the global context attention, and using LayerNorm to normalize the reduced dimension global context attention; inputting the normalized data of each channel into a ReLU activation function, and then increasing the channel dimension to be reduced by another full-connection layer in the bottleneck structure to obtain the characteristics which are fused with global characteristics and adjust the response among channels;
(7) Aggregating the features obtained in the step (6) to each position of the multi-scale feature map in the step (4) through addition to obtain aggregated features;
(8) Performing element addition on the aggregate characteristics obtained in the step (7) according to stages by using a convolution with a kernel size of 1 multiplied by 1 and a channel depth of 1 to obtain composite characteristics;
(9) Up-sampling the composite features in the step (8) by using deconvolution to obtain edge output of each stage, and monitoring the edge output by using loss/sigmoid to optimize the parameters of the edge detector;
(10) Fusing the edge outputs of each stage in the step (9) by using a concat function and 1 multiplied by 1 convolution to obtain an edge prediction graph;
(11) Adjusting edge prediction graphs of other sizes in the image pyramid to the original image size by using bilinear interpolation; averaging the edge prediction graphs with the adjusted sizes to obtain a final prediction graph; the edge detector parameters are constantly learned and optimized by using a loss/sigmoid supervised edge prediction graph.
The loss function of loss/sigmoid is specifically as follows:
one sample of the input training data set T is denoted by (X, Y), where x= { X i I=1, ··, the |x| } is an original input image, y= { Y i ,i=1,···,|X|},y j E {0,1} is the corresponding real edge graph. The training loss of each picture is shown in the formula (1):
wherein,Y + and Y-represents the truth label set of edges and non-edges, respectively, λ represents the auto-balancing positive/negativeParameters of the inter-class loss, W denotes all network layer parameters, P (y i = 1|X; w) is expressed in true value y i If 1, the probability that X is 1 as a result of the parameter W operation is input, P (y i = 0|X; w) is expressed in true value y i If 0, the probability that the result of the parameter W operation is 0 is input.
The final loss is obtained by further polymerizing the edge map of the edge output composition of each stage in step (9), as shown in formula (2):
wherein X is j An edge map, X, representing the output of stage j fuse And an edge graph representing the output of the final fusion layer.
The functions of the attention module are specifically as follows:
first, a 1×1 convolution W with global module G And the softmax function obtains global attention weight, calculates a global context attention pattern S through attention pooling, and shares the global context attention pattern S so that the edge detector can obtain remote global context information. Then convolve W by two 1 x 1 convolutions in the bottleneck structure C Recalibrating the channel response. Finally, global context features are weighted aggregate by addition onto features at each location.
With U= { U n N=1, ··, N represents a multi-scale feature map of the input attention module, where n=h×w is the number of pixels in the feature map. The global context is noted as shown in formula (3):
where n lists all possible positions,is an embedded Gaussian function for computing similarity of embedded space,/>Is a normalization factor, W g Representing a 1 x 1 convolution W G M represents a variable, listing all possible positions.
The number of parameters is reduced from C.C to 2 C.C/r using a bottleneck structure, where C is the number of channels, r is the bottleneck rate, and C/r is the hidden representation dimension of the bottleneck. And a normalization layer is added to the bottleneck transformation before the ReLU layer. With z= { Z n N=1, ··, N represents the output profile of the attention module, the complete attention module is shown in formula (4):
z n =u n +W C2 Relu(LN(W C1 S)) (4);
wherein W is C2 Representing convolution W C2 Convolution operation of LN (W) C1 S) is represented by convolution W C1 The convolution operation is performed on the attention map S and then the layer normalization LN is performed.
The invention has the beneficial effects that: the invention introduces a deep supervision attention structure to complete the edge detection task. The method combines global information of different layers with the self-attention module to effectively model long-range correlation. Finally, the noise regions are filtered by dynamically recalibrating the channel characteristics to help focus the network on the relevant regions in the image. Comparison with more than 10 edge detection methods on the BSDS500 dataset and the NYUD dataset shows that the method can provide accurate and reliable edge detection.
Drawings
FIG. 1 shows the side outputs of the RCF stages. Where the side outputs from the artwork, phase 1, phase 2, phase 3, phase 4, and phase 5 of the BSDS500 dataset are in order from left to right.
FIG. 2 is an architecture of a multi-scale feature edge detection network under deep supervision.
FIG. 3 is a global channel self-attention module.
FIG. 4 is a diagram of a global channel self-attention module architecture.
FIG. 5 is a P-R curve on BSDS500 for our method and other methods.
Fig. 6 is an edge map comparison before NMS on BSDS500 dataset. The first behavior is original image, the second behavior is real edge image, the third behavior is RCF predicted result, and the third behavior is the result of the method of the invention.
Fig. 7 is a PR curve for our method and other work on NYUD.
Detailed Description
The technical scheme of the invention is further described below according to the attached drawings and the embodiments.
1. Edge detection
Edge detection is one of the most basic and challenging problems in computer vision. After decades of research, a great deal of data has emerged. We review only a portion of the work of the relevant representatives in this section.
These methods can be broadly divided into three categories, traditional edge operators, learning-based methods, and more recently, deep learning-based methods. Conventional edge operators detect edges by detecting abrupt changes in luminance, color, and texture. Sobel thresholdes the image gradient to obtain edges. And carrying out edge extraction on the Gaussian smoothed image by using a Canny by adopting a double-threshold method. The Canny algorithm is still popular in various tasks due to its high efficiency and robustness to noise. However, the accuracy of these early methods is difficult to meet today's high demands on detail. Learning-based methods use manual features to identify edges. Martin et al train a classifier to combine texture gradient features. Arbel az et al integrate local cues into a global framework. Lim et al map local blocks to the sktch keys using a random forest to form local edges. Doll ar and Zitnick propose a multi-scale responsive supervised structure edge that can learn clusters and mappings simultaneously and directly output blocks of local edges. However, the manual feature-based method cannot efficiently express high-level information of edges having semantic meaning. In recent years, automatic extraction of depth features using deep learning has achieved advanced results. Shen et al use shape information to learn depth features that fit each subclass. Bertasius et al use CNN to generate features of candidate contour points. Xie and Tu propose an end-to-end model for deep supervision of the features of different scales of the side output, achieving excellent performance (less than 2% difference from human level). On the basis, kokkinos adjusts the loss function, adds training samples, and globally calculates it. Liu et al do side outputs to all convolution layers of VGG16 and further add more features of different scales to improve the effect. Their success has exceeded the performance of humans on the BSDS500 dataset. Our approach is based on RCF, the above training strategy does not explicitly use context information, nor directly impose constraints on neighboring pixel labels, so we use global features to enhance the context modeling of multi-scale side-outputs.
2. Depth attention
Note that the mechanism aims at emphasizing important areas, filtering irrelevant information, and perfecting modeling of long-range correlation. Recently, self-attention mechanisms have been successfully applied to various visual tasks such as image questions, classification, and detection. It embeds the independent response for each location into space and weight averages it to establish the relationship between the local features and their corresponding global context. The PSANET adaptively links each location in the feature map with other locations, enabling aggregation of long-range context information. Senet and Genet re-adjust different channels to recalibrate channel correlation according to global context. However, the rescaled feature fusion approach is ineffective for global context modeling. The present invention employs additive fusion to more efficiently model global contexts.
3. Summary of the method
The VGG16 network consists of 13 convolutional layers, 3 fully connected layers and 5 pooled layers, is deep, high in density, multi-stage, and can efficiently generate acceptable multi-scale features to capture the inherent proportions of the edge map. Recently, the RCF based on VGG16 obtains advanced performance in the edge detection task, and changes the VGG16 by (1) because the step size of the fifth pooling layer is 32, the generated output plane is too small, the interpolation prediction graph is too fuzzy and is unfavorable for edge positioning, so that the fifth pooling layer of VGG16 and all full-connected layers are abandoned, (2) after each convolution layer of VGG16, a convolution layer with a kernel size of 1×1 and a channel depth of 21 is connected to extract different scale features, the multi-scale features of each stage are subjected to element addition by using a convolution layer with a kernel size of 1×1 and a channel depth of 1 to obtain composite features, then the composite features are up-sampled by using a deconvolution layer to serve as edge outputs of each stage, the edge outputs of each stage are fused by using the 1×1 convolution layer, and meanwhile, the edge outputs of each stage and the fused edge outputs of each stage are subjected to depth supervision. The RCF model combines rich features of all convolution layers, thus improving the accuracy of edge detection.
Let (X, Y) denote one sample of our input training dataset T, where x= { X i I=1, ··, the |x| } is an original input image, y= { Y i ,i=1,···,|X|},y j E {0,1} is the corresponding real edge graph. The training loss of each picture is shown in the formula (1):
wherein the method comprises the steps ofY + And Y-represents the truth tab set for edges and non-edges, respectively, λ represents the parameters that automatically balance the loss between positive/negative classes, and W represents all network layer parameters. The final loss can be obtained by further aggregating these generated edge maps, as shown in equation (2):
wherein X is j An edge map, X, representing the output of stage j fuse And an edge graph representing the output of the final fusion layer.
Conventional convolutional neural networks have a local receptive field, and thus the generated feature representation is also local. Without explicit use of long-range context information, local features may cause differences between features of pixels with the same label, causing intra-class inconsistencies, ultimately affecting recognition performance. To address this problem, we have studied a self-attention mechanism that establishes associations between features. First, we capture global context information. The global features are then input to the channel self-attention module. The self-attention module helps to adaptively combine local features with the corresponding global context and can gradually filter out noise by emphasizing useful information. Overview of architecture As shown in FIG. 2, note that the module architecture is shown in FIG. 3, we add a channel self-attention module of global context after the edge output and before the fifth stage to fuse the context information.
4. Global channel self-attention module
First, W is convolved with 1×1 G And the softmax function obtains global attention weight, calculates a global context attention pattern S through attention pooling, and shares the global attention pattern, so that the network can obtain remote global context information. We then convolve W by 1 x 1 C Recalibrating the channel response. Finally, we aggregate global context features weighted (defined by the attention graph) onto features at each location by addition. We use u= { U n N=1, ··, N represents the input feature map and, where n=h×w is the number of pixels in the feature map. Our global attention is shown in formula (3):
where n lists all possible positions,is an embedded Gaussian function for computing similarity of embedded space,/>Is a normalization factor.
In order to lighten the attention module, we use a bottleneck transformation module to scale the number of parameters fromC.C is reduced to 2.C.C/r, where C is the number of channels, r is the bottleneck rate, and C/r is the hidden representation dimension of the bottleneck. Since the two-layer bottleneck transformation increases the difficulty of optimization, a normalization layer is added in the bottleneck transformation before the ReLU layer to simplify the optimization, and meanwhile, the optimization function is also achieved, so that generalization is facilitated, and the optimization is shown in fig. 4. We use z= { Z n N=1, ··, N represents the output profile of the attention module, the complete attention module is shown in formula (4):
z n =u n +W C2 Relu(LN(W C1 S)) (4)
5. experimental data set
To evaluate the proposed method, we performed experiments on the common dataset BSDS500 and NYUD.
The BSDS500 dataset is a dataset provided by the university of berkeley computer vision group that can be used for image segmentation and object edge detection. The dataset contained 200 training samples, 100 validation samples, and 200 test samples. All truth values are noted by 4 to 9 people, which we will consider as truth values if more than half of the tags are labeled. We have extended the training set and validation set of BSDS500, e.g., rotate, flip, scale, and generated 28800 training samples using the same data extension method as the HED. Inspired, we mix the extended dataset of the BSDS500 with the flipped dataset of the PASCAL-Context to form a training dataset with 49006 training samples.
The NYUD dataset consists of 1449 aligned RGB and depth images. In recent years, this dataset has been used for evaluation of edge detection tasks. We use only the RGB part. According to our NYUD dataset, we split into 381 training samples, 414 validation samples and 654 test samples. According to RCF, we train our network using training and validation sets and data expansion by random flipping, scaling and rotation.
Examples
We use PyTorch, well known in the art, to implement our network. Our network is initialized using VGG16 pre-trained on ImageNet. The thresholds λ used in the BSDS500 and NYUD datasets to calculate the loss are set to 1.1 and 1.2, respectively.
The SGD optimizer randomly extracts 10 images in each iteration, and the global learning rate is set to 1e-6, divided by 10 after every 10K iterations. Momentum and weight decay were set to 0.9 and 0.0002, respectively. We performed a total of 40K iterations. All experiments of the present invention were performed on an NVIDIA 1080 GPU.
We tested the edge detection performance under common evaluation index, optimal data set scale (ODS), optimal Image Scale (OIS) and Average Precision (AP). Prior to evaluation, we use non-maximum suppression (NMS) to refine the edges, such as. According to previous work, the positioning tolerance of the maximum allowed distance between the predicted edge and the true value of the BSDS500 dataset is set to 0.0075. Since the image in the NYUD dataset is larger than the image in the BSDS500 dataset, we increase the maximum allowable tolerance between the predicted edge and the true value from 0.0075 to 0.011.
1.1 ablation study
To study the impact of the validation parameters we consider the RCF network as the base line network.
First, we tested the parameters of the attention module, i.e., the bottleneck rate r, on the edge detection results on the BSDS500 dataset. Bottleneck designs aim to reduce parameter redundancy, achieving a balance between performance and parameters. We only add our attention module after the downsampling layer. In Table 1, we change the bottleneck rate r, with increasing parameters and number of triggers as r decreases from 32 to 4, with continuous improvement in performance (0.6% ODS and 0.5% OIS). This demonstrates that our module is effective in improving edge detection performance and that a good balance between performance and parameters is achieved. In the following experiments, we fix r=16.
Table 2 shows a comparison between different phases, i.e. the attention module is added after the different phases. The measured values of ODS and OIS were increased by 0.7% and 0.5%, respectively. The attention module is added after the fourth stage to obtain the best performance.
Table 1 edge detection performance for different bottleneck rates r on BSDS500 dataset
Table 2 BSDS500 data set r=16, increasing attention module performance after different phases
stage ODS OIS AP
baseline .798 .817 -
1,2,3,4 .799 .818 .815
2,3,4 .805 .822 .824
3,4 .805 .822 .830
4 .805 .822 .834
1.2 comparison of Performance with other jobs
On BSDS500 we compared our approach to several of the most advanced edge detection networks. The experimental results on the BSDS500 dataset are summarized in table 3 and fig. 5.
As shown by the results, our networks were improved by 1.7%, 0.9% and 0.6% (ODS), 1.4%, 1.0% and 0.7% (OIS), 0.6%, 2.6% and 0.5% (AP) on average compared to other networks using multi-scale features (HED, RCF and Deep bound). These results indicate that using a global channel self-attention module can improve modeling of context correlation, improving performance of edge detection. Fig. 6 shows a comparison of our method and the predicted results of RCF prior to non-maximal suppression (non-maximum suppression, NMS). It is observed that our method can effectively eliminate most noise and blurred boundaries and produce cleaner and sharper image edges.
Table 3 comparison of BSDS500 dataset with other methods. + represents training using additional PASCAL-Context datasets
Methods ODS OIS AP
Human .803 .803 -
Canny .611 .676 .520
SE .743 .763 .800
OEF .746 .770 .820
DeepEdge .753 .769 .784
DeepContour .757 .776 .790
HFL .767 .788 .795
HED .788 .808 .840
CEDN + .788 .804 -
RDS .792 .810 .818
RCF .798 .815 -
RCF + .806 .824 .840
DeepBoundary .789 .811 .789
DeepBoundary + .809 .827 .861
Ours .805 .822 .834
Ours + .815 .834 .866
In the performance of NyUD Table 4 shows the quantitative results of our method compared to several recent methods, including gPb-UCM, gPb+ NG, OEF, SE, SE +NG+, HED, RCF, and LPCB, and the precision-recall (P-R) curves are shown in FIG. 7. The qualitative results in fig. 7 exhibit performance consistent with experiments on the BSDS500 dataset. The method obtains the optimal performance of ODS measured value of 0.741, and proves the effectiveness of the method.
Table 4 comparison of RGB portions of NYUD dataset with other methods
Methods ODS OIS AP
gPb-UCM .631 .661 .562
gPb+NG .687 .716 .629
OEF .651 .667 -
SE .695 .708 .679
SE+NG+ .706 .734 .738
HED .717 .732 .734
RCF .729 .742 -
LPCB .739 .754 -
Ours .741 .759 .740

Claims (3)

1. A multi-scale edge detection method under deep supervision is characterized by comprising the following specific steps:
(1) The constructed edge detector comprises an improved VGG16 network and an attention module; the improved VGG16 network removes the fifth pooling layer and all full connection layers of the original VGG16, and keeps 13 convolution layers and the first four pooling layers; the attention module is composed of a global module and a channel module, wherein the global module comprises a 1 multiplied by 1 convolution layer and a softmax function layer, the channel module comprises a bottleneck structure, a normalization layer and a Relu activation layer, the bottleneck structure comprises two full connection layers, and each full connection layer is a 1 multiplied by 1 convolution layer;
(2) Initializing the improved VGG16 network using the VGG16 pre-trained on ImageNet;
(3) Expanding the images of the data sets by using rotation, flipping and scaling, adjusting the sizes of the images by 0.5, 1.0 and 1.5 times to construct image pyramids, and sequentially inputting the image pyramids of each data set into an edge detector;
(4) The improved VGG16 network carries out phase 1 to phase 4 convolution operation on the input data set image, the attention module carries out 1X 1 convolution operation on the output of the 4 th phase, the operation result is input into a softmax function to obtain a global context attention figure, and the global context attention figure is shared with each channel of the output characteristic of the 4 th phase; the dimension of the output characteristic channel of the 4 th stage fused with the global context attention is reduced by using one full connection layer in the bottleneck structure, and the dimension-reduced global context attention is normalized by using LayerNorm; inputting the normalized data of each channel into a ReLU activation function, and then increasing the channel dimension to be reduced by another full-connection layer in the bottleneck structure to obtain the characteristics which are fused with global characteristics and adjust the response among channels; inputting the obtained features which are fused with the global features and adjust the response among channels into a convolution layer of the stage 5, and carrying out the stage 5 convolution operation; then downsampling the output of each convolution layer from the stage 1 to the stage 5, and extracting multi-scale features to obtain a multi-scale feature map;
(5) The global module of the attention module carries out 1X 1 convolution operation on the multi-scale feature obtained in the step (4), inputs the operation result into a softmax function to obtain a global context attention pattern, and shares the global context attention pattern with each channel of the multi-scale feature;
(6) Using one full connection layer in the bottleneck structure to reduce the dimension of the multi-scale feature channel fused with the global context attention, and using LayerNorm to normalize the reduced dimension global context attention; inputting the normalized data of each channel into a ReLU activation function, and then increasing the channel dimension to be reduced by another full-connection layer in the bottleneck structure to obtain the characteristics which are fused with global characteristics and adjust the response among channels;
(7) Aggregating the features obtained in the step (6) to each position of the multi-scale feature map in the step (4) through addition to obtain aggregated features;
(8) Performing element addition on the aggregate characteristics obtained in the step (7) according to stages by using a convolution with a kernel size of 1 multiplied by 1 and a channel depth of 1 to obtain composite characteristics;
(9) Up-sampling the composite features in the step (8) by using deconvolution to obtain edge output of each stage, and monitoring the edge output by using loss/sigmoid to optimize the parameters of the edge detector;
(10) Fusing the edge outputs of each stage in the step (9) by using a concat function and 1 multiplied by 1 convolution to obtain an edge prediction graph;
(11) Adjusting edge prediction graphs of other sizes in the image pyramid to the original image size by using bilinear interpolation; averaging the edge prediction graphs with the adjusted sizes to obtain a final prediction graph; the edge detector parameters are constantly learned and optimized by using a loss/sigmoid supervised edge prediction graph.
2. The method for multi-scale edge detection under deep supervision according to claim 1, wherein the loss function of loss/sigmoid is specifically as follows:
one sample of the input training data set T is denoted by (X, Y), where x= { X i I=1, ··, the |x| } is an original input image, y= { Y i ,i=1,···,|X|},y j E {0,1} is the corresponding real edge graph; the training loss of each picture is shown in the formula (1):
wherein,Y + and Y - True value label sets representing edges and non-edges, respectively, lambda represents parameters that automatically balance losses between positive/negative classes, W represents all network layer parameters, P (y i = 1|X; w) is expressed in true value y i If 1, the probability that X is 1 as a result of the parameter W operation is input, P (y i = 0|X; w) is expressed in true value y i When the value is 0, inputting the probability that the result of X is 0 under the operation of the parameter W;
the final loss is obtained by further polymerizing the edge map of the edge output composition of each stage in step (9), as shown in formula (2):
wherein X is j An edge map, X, representing the output of stage j fuse And an edge graph representing the output of the final fusion layer.
3. A method for multi-scale edge detection under deep supervision according to claim 1 or 2, wherein the functions of the attention module are as follows: first, a 1×1 convolution W with global module G And the softmax function obtains global attention weight, calculates a global context attention pattern S through attention pooling, and shares the global context attention pattern S so that the edge detector can obtain remote global context information; then convolve W by two 1 x 1 convolutions in the bottleneck structure C Re-establishmentCalibrating the channel response; finally, the global context features are weighted and aggregated to the features of each position through addition;
with U= { U n N=1, ··, N represents a multi-scale feature map of the input attention module, where n=h×w is the number of pixels in the feature map; the global context is noted as shown in formula (3):
where n lists all possible positions,is an embedded Gaussian function for computing similarity of embedded space,/>Is a normalization factor, wg represents a 1X 1 convolution W G M represents a variable, listing all possible positions;
reducing the number of parameters from C.C to 2.C.C/r by using a bottleneck structure, wherein C is the number of channels, r is the bottleneck rate, and C/r is the hidden representation dimension of the bottleneck; adding a normalization layer in bottleneck transformation before the ReLU layer; with z= { Z n N=1, ··, N represents the output profile of the attention module, the complete attention module is shown in formula (4):
z n =u n +W C2 Relu(LN(W C1 S)) (4);
wherein W is C2 Representing convolution W C2 Convolution operation of LN (W) C1 S) is represented by convolution W C1 The convolution operation is performed on the attention map S and then the layer normalization LN is performed.
CN202011445466.4A 2020-12-11 2020-12-11 Multi-scale edge detection method under deep supervision Active CN112580661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011445466.4A CN112580661B (en) 2020-12-11 2020-12-11 Multi-scale edge detection method under deep supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011445466.4A CN112580661B (en) 2020-12-11 2020-12-11 Multi-scale edge detection method under deep supervision

Publications (2)

Publication Number Publication Date
CN112580661A CN112580661A (en) 2021-03-30
CN112580661B true CN112580661B (en) 2024-03-08

Family

ID=75130942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011445466.4A Active CN112580661B (en) 2020-12-11 2020-12-11 Multi-scale edge detection method under deep supervision

Country Status (1)

Country Link
CN (1) CN112580661B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344005B (en) * 2021-05-12 2022-04-15 武汉大学 Image edge detection method based on optimized small-scale features
CN113469199A (en) * 2021-07-15 2021-10-01 中国人民解放军国防科技大学 Rapid and efficient image edge detection method based on deep learning
CN115019022B (en) * 2022-05-30 2024-04-30 电子科技大学 Contour detection method based on double-depth fusion network
CN116400490B (en) * 2023-06-08 2023-08-25 杭州华得森生物技术有限公司 Fluorescence microscopic imaging system and method thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110009679A (en) * 2019-02-28 2019-07-12 江南大学 A kind of object localization method based on Analysis On Multi-scale Features convolutional neural networks
CN110648316A (en) * 2019-09-07 2020-01-03 创新奇智(成都)科技有限公司 Steel coil end face edge detection algorithm based on deep learning
CN110706242A (en) * 2019-08-26 2020-01-17 浙江工业大学 Object-level edge detection method based on depth residual error network
CN111462126A (en) * 2020-04-08 2020-07-28 武汉大学 Semantic image segmentation method and system based on edge enhancement

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110009679A (en) * 2019-02-28 2019-07-12 江南大学 A kind of object localization method based on Analysis On Multi-scale Features convolutional neural networks
CN110706242A (en) * 2019-08-26 2020-01-17 浙江工业大学 Object-level edge detection method based on depth residual error network
CN110648316A (en) * 2019-09-07 2020-01-03 创新奇智(成都)科技有限公司 Steel coil end face edge detection algorithm based on deep learning
CN111462126A (en) * 2020-04-08 2020-07-28 武汉大学 Semantic image segmentation method and system based on edge enhancement

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
复杂监控背景下基于边缘感知学习网络的行为识别算法;聂玮;曹悦;朱冬雪;朱艺璇;黄林毅;;计算机应用与软件;20200812(第08期);全文 *
形态学与RCF相结合的唐卡图像边缘检测算法;刘千;葛阿雷;史伟;;计算机应用与软件;20190612(第06期);全文 *

Also Published As

Publication number Publication date
CN112580661A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN112580661B (en) Multi-scale edge detection method under deep supervision
CN113313657B (en) Unsupervised learning method and system for low-illumination image enhancement
CN111209952B (en) Underwater target detection method based on improved SSD and migration learning
CN112507997B (en) Face super-resolution system based on multi-scale convolution and receptive field feature fusion
CN111950649B (en) Attention mechanism and capsule network-based low-illumination image classification method
CN113052210A (en) Fast low-illumination target detection method based on convolutional neural network
Sun et al. Robust retinal vessel segmentation from a data augmentation perspective
CN109034184B (en) Grading ring detection and identification method based on deep learning
CN112614077A (en) Unsupervised low-illumination image enhancement method based on generation countermeasure network
CN113780132B (en) Lane line detection method based on convolutional neural network
CN115063373A (en) Social network image tampering positioning method based on multi-scale feature intelligent perception
CN111582092B (en) Pedestrian abnormal behavior detection method based on human skeleton
CN112150493A (en) Semantic guidance-based screen area detection method in natural scene
CN110503140B (en) Deep migration learning and neighborhood noise reduction based classification method
CN113393457B (en) Anchor-frame-free target detection method combining residual error dense block and position attention
CN111667019B (en) Hyperspectral image classification method based on deformable separation convolution
CN110807742A (en) Low-light-level image enhancement method based on integrated network
CN113392711A (en) Smoke semantic segmentation method and system based on high-level semantics and noise suppression
CN114419413A (en) Method for constructing sensing field self-adaptive transformer substation insulator defect detection neural network
CN116071676A (en) Infrared small target detection method based on attention-directed pyramid fusion
CN112488958A (en) Image contrast enhancement method based on scale space
Guo et al. Multifeature extracting CNN with concatenation for image denoising
CN113361466B (en) Multispectral target detection method based on multi-mode cross guidance learning
CN112488220B (en) Small target detection method based on deep learning
CN117392375A (en) Target detection algorithm for tiny objects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant