CN112580661B

CN112580661B - Multi-scale edge detection method under deep supervision

Info

Publication number: CN112580661B
Application number: CN202011445466.4A
Authority: CN
Inventors: 孙俊; 张旺; 吴豪; 吴小俊; 方伟; 陈祺东; 李超; 游琪; 冒钟杰
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2024-03-08
Anticipated expiration: 2040-12-11
Also published as: CN112580661A

Abstract

A multi-scale edge detection method under deep supervision. The method can combine local features with their corresponding global correlations, adaptively recalibrate channel responses, direct the network to ignore irrelevant information, and emphasize correlations between related features. The effectiveness of the multi-scale, deep-supervision, self-care module algorithm was demonstrated by conducting a series of ablative experiments on the BSDS500 dataset and NYUD dataset. Compared with other most advanced edge detection networks, the algorithm has better performance, improves the prediction precision by using fewer parameters, realizes the score of 0.815 of ODS measurement value on the BSDS00 data set, and is 0.9 percent higher than the prior other algorithms.

Description

Multi-scale edge detection method under deep supervision

Technical Field

The invention belongs to the field of edge detection, and particularly relates to a multi-scale edge detection method under deep supervision.

Background

Edge detection aims at extracting object boundaries and visually distinct edges in natural images, which are important for advanced computer vision tasks such as image segmentation, object detection/recognition. As a basis for advanced tasks, edge detection has a rich history, and we now focus on several representative works that have proven significant. Early conventional approaches included Sobel detectors, zero crossing detection, and the widely used Canny detector. Pb, gPb, sktch token, and Structured Edges use complex learning paradigms to distinguish edge pixels based on manual features (e.g., brightness, color, gradient, and texture). However, it is difficult to represent the meaning of semantics using low-level visual cues.

The edges of the image are made up of meaningful local detail and object-level boundaries. Since CNN has strong ability to automatically learn advanced features of natural images, it is used for edge detection and achieves good effect, such as N ⁴ Fields, deep content, deep Edge and CSCNN. To obtain diverse edge scales, CNN-based HED and RCF supervise the predictions of different network layers through real edge graphs, lower layers detect more local detail, while higher layers capture object-level boundaries with larger receptive fields. At high recall, HED indicates that deep supervision can compromise low-level prediction, facilitating learning global object boundaries. Rich convolution features are very effective for many visual tasks, but HEDs andthe RCF still does not explicitly use global context information for training and prediction strategies of side outputs, and does not directly impose constraints on neighboring pixel labels to enhance depth supervision. Thus, we can improve the quality of the network representation by explicitly modeling channel correlation, the network can adaptively recalibrate the channel response, and learn to use global information to emphasize useful features and suppress less useful features.

As shown in fig. 1, as the receptive field size becomes larger, we find that the edges captured by the different convolution layers become progressively coarser and lose much of the useful detail. The purpose of capturing long-range correlations is to extract a global understanding of the visual scene, which has proven useful for a wide range of recognition tasks, such as image/video classification, object detection and segmentation, which are required by RCFs. In CNN, long-range correlation is modeled mainly by depth-stacking convolution layers, since the convolution layers establish pixel relationships in local neighborhoods. However, direct repeated convolution layers are computationally inefficient, difficult to optimize, and difficult to transfer information between remote locations, which results in ineffective modeling of remote correlations. To solve this problem, we model the global context to form an attention map, and then aggregate the features of all locations with weights defined by the attention map. Finally, the aggregated features and the features for each location are added to form a new feature.

Disclosure of Invention

The invention aims to provide a multi-scale edge detection method under deep supervision, which solves the problems existing in the prior art.

The technical scheme of the invention is as follows:

a multi-scale edge detection method under depth supervision comprises the following specific steps:

(1) The constructed edge detector comprises an improved VGG16 network and an attention module; the improved VGG16 network removes the fifth pooling layer and all full connection layers of the original VGG16, and keeps 13 convolution layers and the first four pooling layers; the attention module is composed of a global module and a channel module, wherein the global module comprises a 1 multiplied by 1 convolution layer and a softmax function layer, the channel module comprises a bottleneck structure, a normalization layer and a Relu activation layer, the bottleneck structure comprises two full connection layers, and each full connection layer is a 1 multiplied by 1 convolution layer;

(2) Initializing the improved VGG16 network using the VGG16 pre-trained on ImageNet;

(3) Expanding the images of the data sets by using rotation, flipping and scaling, adjusting the sizes of the images by 0.5, 1.0 and 1.5 times to construct image pyramids, and sequentially inputting the image pyramids of each data set into an edge detector;

(4) The improved VGG16 network carries out phase 1 to phase 4 convolution operation on the input data set image, the attention module carries out 1X 1 convolution operation on the output of the 4 th phase, the operation result is input into a softmax function to obtain a global context attention figure, and the global context attention figure is shared with each channel of the output characteristic of the 4 th phase; the dimension of the output characteristic channel of the 4 th stage fused with the global context attention is reduced by using one full connection layer in the bottleneck structure, and the dimension-reduced global context attention is normalized by using LayerNorm; inputting the normalized data of each channel into a ReLU activation function, and then increasing the channel dimension to be reduced by another full-connection layer in the bottleneck structure to obtain the characteristics which are fused with global characteristics and adjust the response among channels; inputting the obtained features which are fused with the global features and adjust the response among channels into a convolution layer of the stage 5, and carrying out the stage 5 convolution operation; then downsampling the output of each convolution layer from the stage 1 to the stage 5, and extracting multi-scale features to obtain a multi-scale feature map;

(5) The global module of the attention module carries out 1X 1 convolution operation on the multi-scale feature obtained in the step (4), inputs the operation result into a softmax function to obtain a global context attention pattern, and shares the global context attention pattern with each channel of the multi-scale feature;

(6) Using one full connection layer in the bottleneck structure to reduce the dimension of the multi-scale feature channel fused with the global context attention, and using LayerNorm to normalize the reduced dimension global context attention; inputting the normalized data of each channel into a ReLU activation function, and then increasing the channel dimension to be reduced by another full-connection layer in the bottleneck structure to obtain the characteristics which are fused with global characteristics and adjust the response among channels;

(7) Aggregating the features obtained in the step (6) to each position of the multi-scale feature map in the step (4) through addition to obtain aggregated features;

(8) Performing element addition on the aggregate characteristics obtained in the step (7) according to stages by using a convolution with a kernel size of 1 multiplied by 1 and a channel depth of 1 to obtain composite characteristics;

(9) Up-sampling the composite features in the step (8) by using deconvolution to obtain edge output of each stage, and monitoring the edge output by using loss/sigmoid to optimize the parameters of the edge detector;

(10) Fusing the edge outputs of each stage in the step (9) by using a concat function and 1 multiplied by 1 convolution to obtain an edge prediction graph;

(11) Adjusting edge prediction graphs of other sizes in the image pyramid to the original image size by using bilinear interpolation; averaging the edge prediction graphs with the adjusted sizes to obtain a final prediction graph; the edge detector parameters are constantly learned and optimized by using a loss/sigmoid supervised edge prediction graph.

The loss function of loss/sigmoid is specifically as follows:

one sample of the input training data set T is denoted by (X, Y), where x= { X _i I=1, ··, the |x| } is an original input image, y= { Y _i ,i＝1,···,|X|},y _j E {0,1} is the corresponding real edge graph. The training loss of each picture is shown in the formula (1):

wherein,Y ₊ and Y-represents the truth label set of edges and non-edges, respectively, λ represents the auto-balancing positive/negativeParameters of the inter-class loss, W denotes all network layer parameters, P (y _i = 1|X; w) is expressed in true value y _i If 1, the probability that X is 1 as a result of the parameter W operation is input, P (y _i = 0|X; w) is expressed in true value y _i If 0, the probability that the result of the parameter W operation is 0 is input.

The final loss is obtained by further polymerizing the edge map of the edge output composition of each stage in step (9), as shown in formula (2):

wherein X is ^j An edge map, X, representing the output of stage j ^fuse And an edge graph representing the output of the final fusion layer.

The functions of the attention module are specifically as follows:

first, a 1×1 convolution W with global module _G And the softmax function obtains global attention weight, calculates a global context attention pattern S through attention pooling, and shares the global context attention pattern S so that the edge detector can obtain remote global context information. Then convolve W by two 1 x 1 convolutions in the bottleneck structure _C Recalibrating the channel response. Finally, global context features are weighted aggregate by addition onto features at each location.

With U= { U _n N=1, ··, N represents a multi-scale feature map of the input attention module, where n=h×w is the number of pixels in the feature map. The global context is noted as shown in formula (3):

where n lists all possible positions,is an embedded Gaussian function for computing similarity of embedded space,/>Is a normalization factor, W _g Representing a 1 x 1 convolution W _G M represents a variable, listing all possible positions.

The number of parameters is reduced from C.C to 2 C.C/r using a bottleneck structure, where C is the number of channels, r is the bottleneck rate, and C/r is the hidden representation dimension of the bottleneck. And a normalization layer is added to the bottleneck transformation before the ReLU layer. With z= { Z _n N=1, ··, N represents the output profile of the attention module, the complete attention module is shown in formula (4):

z _n ＝u _n +W _C2 Relu(LN(W _C1 S)) (4)；

wherein W is _C2 Representing convolution W _C2 Convolution operation of LN (W) _C1 S) is represented by convolution W _C1 The convolution operation is performed on the attention map S and then the layer normalization LN is performed.

The invention has the beneficial effects that: the invention introduces a deep supervision attention structure to complete the edge detection task. The method combines global information of different layers with the self-attention module to effectively model long-range correlation. Finally, the noise regions are filtered by dynamically recalibrating the channel characteristics to help focus the network on the relevant regions in the image. Comparison with more than 10 edge detection methods on the BSDS500 dataset and the NYUD dataset shows that the method can provide accurate and reliable edge detection.

Drawings

FIG. 1 shows the side outputs of the RCF stages. Where the side outputs from the artwork, phase 1, phase 2, phase 3, phase 4, and phase 5 of the BSDS500 dataset are in order from left to right.

FIG. 2 is an architecture of a multi-scale feature edge detection network under deep supervision.

FIG. 3 is a global channel self-attention module.

FIG. 4 is a diagram of a global channel self-attention module architecture.

FIG. 5 is a P-R curve on BSDS500 for our method and other methods.

Fig. 6 is an edge map comparison before NMS on BSDS500 dataset. The first behavior is original image, the second behavior is real edge image, the third behavior is RCF predicted result, and the third behavior is the result of the method of the invention.

Fig. 7 is a PR curve for our method and other work on NYUD.

Detailed Description

The technical scheme of the invention is further described below according to the attached drawings and the embodiments.

1. Edge detection

Edge detection is one of the most basic and challenging problems in computer vision. After decades of research, a great deal of data has emerged. We review only a portion of the work of the relevant representatives in this section.

These methods can be broadly divided into three categories, traditional edge operators, learning-based methods, and more recently, deep learning-based methods. Conventional edge operators detect edges by detecting abrupt changes in luminance, color, and texture. Sobel thresholdes the image gradient to obtain edges. And carrying out edge extraction on the Gaussian smoothed image by using a Canny by adopting a double-threshold method. The Canny algorithm is still popular in various tasks due to its high efficiency and robustness to noise. However, the accuracy of these early methods is difficult to meet today's high demands on detail. Learning-based methods use manual features to identify edges. Martin et al train a classifier to combine texture gradient features. Arbel az et al integrate local cues into a global framework. Lim et al map local blocks to the sktch keys using a random forest to form local edges. Doll ar and Zitnick propose a multi-scale responsive supervised structure edge that can learn clusters and mappings simultaneously and directly output blocks of local edges. However, the manual feature-based method cannot efficiently express high-level information of edges having semantic meaning. In recent years, automatic extraction of depth features using deep learning has achieved advanced results. Shen et al use shape information to learn depth features that fit each subclass. Bertasius et al use CNN to generate features of candidate contour points. Xie and Tu propose an end-to-end model for deep supervision of the features of different scales of the side output, achieving excellent performance (less than 2% difference from human level). On the basis, kokkinos adjusts the loss function, adds training samples, and globally calculates it. Liu et al do side outputs to all convolution layers of VGG16 and further add more features of different scales to improve the effect. Their success has exceeded the performance of humans on the BSDS500 dataset. Our approach is based on RCF, the above training strategy does not explicitly use context information, nor directly impose constraints on neighboring pixel labels, so we use global features to enhance the context modeling of multi-scale side-outputs.

2. Depth attention

Note that the mechanism aims at emphasizing important areas, filtering irrelevant information, and perfecting modeling of long-range correlation. Recently, self-attention mechanisms have been successfully applied to various visual tasks such as image questions, classification, and detection. It embeds the independent response for each location into space and weight averages it to establish the relationship between the local features and their corresponding global context. The PSANET adaptively links each location in the feature map with other locations, enabling aggregation of long-range context information. Senet and Genet re-adjust different channels to recalibrate channel correlation according to global context. However, the rescaled feature fusion approach is ineffective for global context modeling. The present invention employs additive fusion to more efficiently model global contexts.

3. Summary of the method

The VGG16 network consists of 13 convolutional layers, 3 fully connected layers and 5 pooled layers, is deep, high in density, multi-stage, and can efficiently generate acceptable multi-scale features to capture the inherent proportions of the edge map. Recently, the RCF based on VGG16 obtains advanced performance in the edge detection task, and changes the VGG16 by (1) because the step size of the fifth pooling layer is 32, the generated output plane is too small, the interpolation prediction graph is too fuzzy and is unfavorable for edge positioning, so that the fifth pooling layer of VGG16 and all full-connected layers are abandoned, (2) after each convolution layer of VGG16, a convolution layer with a kernel size of 1×1 and a channel depth of 21 is connected to extract different scale features, the multi-scale features of each stage are subjected to element addition by using a convolution layer with a kernel size of 1×1 and a channel depth of 1 to obtain composite features, then the composite features are up-sampled by using a deconvolution layer to serve as edge outputs of each stage, the edge outputs of each stage are fused by using the 1×1 convolution layer, and meanwhile, the edge outputs of each stage and the fused edge outputs of each stage are subjected to depth supervision. The RCF model combines rich features of all convolution layers, thus improving the accuracy of edge detection.

Let (X, Y) denote one sample of our input training dataset T, where x= { X _i I=1, ··, the |x| } is an original input image, y= { Y _i ,i＝1,···,|X|},y _j E {0,1} is the corresponding real edge graph. The training loss of each picture is shown in the formula (1):

wherein the method comprises the steps ofY ₊ And Y-represents the truth tab set for edges and non-edges, respectively, λ represents the parameters that automatically balance the loss between positive/negative classes, and W represents all network layer parameters. The final loss can be obtained by further aggregating these generated edge maps, as shown in equation (2):

Conventional convolutional neural networks have a local receptive field, and thus the generated feature representation is also local. Without explicit use of long-range context information, local features may cause differences between features of pixels with the same label, causing intra-class inconsistencies, ultimately affecting recognition performance. To address this problem, we have studied a self-attention mechanism that establishes associations between features. First, we capture global context information. The global features are then input to the channel self-attention module. The self-attention module helps to adaptively combine local features with the corresponding global context and can gradually filter out noise by emphasizing useful information. Overview of architecture As shown in FIG. 2, note that the module architecture is shown in FIG. 3, we add a channel self-attention module of global context after the edge output and before the fifth stage to fuse the context information.

4. Global channel self-attention module

First, W is convolved with 1×1 _G And the softmax function obtains global attention weight, calculates a global context attention pattern S through attention pooling, and shares the global attention pattern, so that the network can obtain remote global context information. We then convolve W by 1 x 1 _C Recalibrating the channel response. Finally, we aggregate global context features weighted (defined by the attention graph) onto features at each location by addition. We use u= { U _n N=1, ··, N represents the input feature map and, where n=h×w is the number of pixels in the feature map. Our global attention is shown in formula (3):

where n lists all possible positions,is an embedded Gaussian function for computing similarity of embedded space,/>Is a normalization factor.

In order to lighten the attention module, we use a bottleneck transformation module to scale the number of parameters fromC.C is reduced to 2.C.C/r, where C is the number of channels, r is the bottleneck rate, and C/r is the hidden representation dimension of the bottleneck. Since the two-layer bottleneck transformation increases the difficulty of optimization, a normalization layer is added in the bottleneck transformation before the ReLU layer to simplify the optimization, and meanwhile, the optimization function is also achieved, so that generalization is facilitated, and the optimization is shown in fig. 4. We use z= { Z _n N=1, ··, N represents the output profile of the attention module, the complete attention module is shown in formula (4):

z _n ＝u _n +W _C2 Relu(LN(W _C1 S)) (4)

5. experimental data set

To evaluate the proposed method, we performed experiments on the common dataset BSDS500 and NYUD.

The BSDS500 dataset is a dataset provided by the university of berkeley computer vision group that can be used for image segmentation and object edge detection. The dataset contained 200 training samples, 100 validation samples, and 200 test samples. All truth values are noted by 4 to 9 people, which we will consider as truth values if more than half of the tags are labeled. We have extended the training set and validation set of BSDS500, e.g., rotate, flip, scale, and generated 28800 training samples using the same data extension method as the HED. Inspired, we mix the extended dataset of the BSDS500 with the flipped dataset of the PASCAL-Context to form a training dataset with 49006 training samples.

The NYUD dataset consists of 1449 aligned RGB and depth images. In recent years, this dataset has been used for evaluation of edge detection tasks. We use only the RGB part. According to our NYUD dataset, we split into 381 training samples, 414 validation samples and 654 test samples. According to RCF, we train our network using training and validation sets and data expansion by random flipping, scaling and rotation.

Examples

We use PyTorch, well known in the art, to implement our network. Our network is initialized using VGG16 pre-trained on ImageNet. The thresholds λ used in the BSDS500 and NYUD datasets to calculate the loss are set to 1.1 and 1.2, respectively.

The SGD optimizer randomly extracts 10 images in each iteration, and the global learning rate is set to 1e-6, divided by 10 after every 10K iterations. Momentum and weight decay were set to 0.9 and 0.0002, respectively. We performed a total of 40K iterations. All experiments of the present invention were performed on an NVIDIA 1080 GPU.

We tested the edge detection performance under common evaluation index, optimal data set scale (ODS), optimal Image Scale (OIS) and Average Precision (AP). Prior to evaluation, we use non-maximum suppression (NMS) to refine the edges, such as. According to previous work, the positioning tolerance of the maximum allowed distance between the predicted edge and the true value of the BSDS500 dataset is set to 0.0075. Since the image in the NYUD dataset is larger than the image in the BSDS500 dataset, we increase the maximum allowable tolerance between the predicted edge and the true value from 0.0075 to 0.011.

1.1 ablation study

To study the impact of the validation parameters we consider the RCF network as the base line network.

First, we tested the parameters of the attention module, i.e., the bottleneck rate r, on the edge detection results on the BSDS500 dataset. Bottleneck designs aim to reduce parameter redundancy, achieving a balance between performance and parameters. We only add our attention module after the downsampling layer. In Table 1, we change the bottleneck rate r, with increasing parameters and number of triggers as r decreases from 32 to 4, with continuous improvement in performance (0.6% ODS and 0.5% OIS). This demonstrates that our module is effective in improving edge detection performance and that a good balance between performance and parameters is achieved. In the following experiments, we fix r=16.

Table 2 shows a comparison between different phases, i.e. the attention module is added after the different phases. The measured values of ODS and OIS were increased by 0.7% and 0.5%, respectively. The attention module is added after the fourth stage to obtain the best performance.

Table 1 edge detection performance for different bottleneck rates r on BSDS500 dataset

Table 2 BSDS500 data set r=16, increasing attention module performance after different phases

stage	ODS	OIS	AP
				baseline	.798	.817	-
1,2,3,4	.799	.818	.815
				2,3,4	.805	.822	.824
3,4	.805	.822	.830
				4	.805	.822	.834

1.2 comparison of Performance with other jobs

On BSDS500 we compared our approach to several of the most advanced edge detection networks. The experimental results on the BSDS500 dataset are summarized in table 3 and fig. 5.

As shown by the results, our networks were improved by 1.7%, 0.9% and 0.6% (ODS), 1.4%, 1.0% and 0.7% (OIS), 0.6%, 2.6% and 0.5% (AP) on average compared to other networks using multi-scale features (HED, RCF and Deep bound). These results indicate that using a global channel self-attention module can improve modeling of context correlation, improving performance of edge detection. Fig. 6 shows a comparison of our method and the predicted results of RCF prior to non-maximal suppression (non-maximum suppression, NMS). It is observed that our method can effectively eliminate most noise and blurred boundaries and produce cleaner and sharper image edges.

Table 3 comparison of BSDS500 dataset with other methods. + represents training using additional PASCAL-Context datasets

Methods	ODS	OIS	AP
				Human	.803	.803	-
Canny	.611	.676	.520
				SE	.743	.763	.800
OEF	.746	.770	.820
				DeepEdge	.753	.769	.784
DeepContour	.757	.776	.790
				HFL	.767	.788	.795
HED	.788	.808	.840
				CEDN ⁺	.788	.804	-
RDS	.792	.810	.818
				RCF	.798	.815	-
RCF ⁺	.806	.824	.840
				DeepBoundary	.789	.811	.789
DeepBoundary ⁺	.809	.827	.861
				Ours	.805	.822	.834
Ours ⁺	.815	.834	.866

In the performance of NyUD Table 4 shows the quantitative results of our method compared to several recent methods, including gPb-UCM, gPb+ NG, OEF, SE, SE +NG+, HED, RCF, and LPCB, and the precision-recall (P-R) curves are shown in FIG. 7. The qualitative results in fig. 7 exhibit performance consistent with experiments on the BSDS500 dataset. The method obtains the optimal performance of ODS measured value of 0.741, and proves the effectiveness of the method.

Table 4 comparison of RGB portions of NYUD dataset with other methods

Methods	ODS	OIS	AP
				gPb-UCM	.631	.661	.562
gPb+NG	.687	.716	.629
				OEF	.651	.667	-
SE	.695	.708	.679
				SE+NG+	.706	.734	.738
HED	.717	.732	.734
				RCF	.729	.742	-
LPCB	.739	.754	-
				Ours	.741	.759	.740

Claims

1. A multi-scale edge detection method under deep supervision is characterized by comprising the following specific steps:

2. The method for multi-scale edge detection under deep supervision according to claim 1, wherein the loss function of loss/sigmoid is specifically as follows:

one sample of the input training data set T is denoted by (X, Y), where x= { X _i I=1, ··, the |x| } is an original input image, y= { Y _i ,i＝1,···,|X|},y _j E {0,1} is the corresponding real edge graph; the training loss of each picture is shown in the formula (1):

wherein,Y ₊ and Y _- True value label sets representing edges and non-edges, respectively, lambda represents parameters that automatically balance losses between positive/negative classes, W represents all network layer parameters, P (y _i = 1|X; w) is expressed in true value y _i If 1, the probability that X is 1 as a result of the parameter W operation is input, P (y _i = 0|X; w) is expressed in true value y _i When the value is 0, inputting the probability that the result of X is 0 under the operation of the parameter W;

3. A method for multi-scale edge detection under deep supervision according to claim 1 or 2, wherein the functions of the attention module are as follows: first, a 1×1 convolution W with global module _G And the softmax function obtains global attention weight, calculates a global context attention pattern S through attention pooling, and shares the global context attention pattern S so that the edge detector can obtain remote global context information; then convolve W by two 1 x 1 convolutions in the bottleneck structure _C Re-establishmentCalibrating the channel response; finally, the global context features are weighted and aggregated to the features of each position through addition;

with U= { U _n N=1, ··, N represents a multi-scale feature map of the input attention module, where n=h×w is the number of pixels in the feature map; the global context is noted as shown in formula (3):

where n lists all possible positions,is an embedded Gaussian function for computing similarity of embedded space,/>Is a normalization factor, wg represents a 1X 1 convolution W _G M represents a variable, listing all possible positions;

reducing the number of parameters from C.C to 2.C.C/r by using a bottleneck structure, wherein C is the number of channels, r is the bottleneck rate, and C/r is the hidden representation dimension of the bottleneck; adding a normalization layer in bottleneck transformation before the ReLU layer; with z= { Z _n N=1, ··, N represents the output profile of the attention module, the complete attention module is shown in formula (4):

z _n ＝u _n +W _C2 Relu(LN(W _C1 S)) (4)；