CN116563682A

CN116563682A - Attention scheme and strip convolution semantic line detection method based on depth Hough network

Info

Publication number: CN116563682A
Application number: CN202310532781.8A
Authority: CN
Inventors: 王亮; 章航
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-05-11
Filing date: 2023-05-11
Publication date: 2023-08-08

Abstract

A method for detecting attention scheme and strip convolution semantic line based on depth Hough network belongs to the field of image target recognition. The method comprises the following steps: and adjusting the image size of the data set, inputting the preprocessed image into a convolution model for feature extraction and attention mechanism, and fusing multi-scale features to obtain an output feature result. Carrying out Hough transformation on the features with different scales, and carrying out fusion feature regression prediction to obtain a recognition result; the strip convolution replaces the original space convolution layer to obtain remote related semantic information of an input image recognition target; carrying out mixed pooling layer pooling on the pictures of the forward network; the feature pyramid network adds a channel attention module to help model the importance of the picture partition so as to reduce the influence of redundant information of an input image on a recognition result; designing a Ghost convolution lightweight network reduces the computational load of the network. Compared with the conventional semantic line detection method, the method has the advantages of less model parameter quantity and higher recognition accuracy.

Description

Attention scheme and strip convolution semantic line detection method based on depth Hough network

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a method for detecting attention schemes and strip convolution semantic lines based on a depth Hough network.

Background

Straight line detection is a classical problem in computer vision research. The semantic line is used as a special straight line of picture distribution, and the specific meaning of the semantic line is expressed as follows: on the premise that the pictures contain different backgrounds, dividing straight lines between different semantic backgrounds. Which can explicitly represent different background distributions in the image. The photographic composition of the photographer can be optimized by adjusting the semantic line distribution of the image.

Semantic line detection has practical significance. Such as estimating picture levelness and optimizing photographic image composition. In addition to this, some preliminary operations of visual downstream tasks can be performed: lane detection, horizon detection, sea level detection. The optimal detection can be performed as a preliminary operation by semantic line detection. The semantic lines are boundaries between different scenes in the picture, so that the semantic lines are used as boundaries to perform preliminary semantic background segmentation, and the semantic lines are used as boundaries to simplify the semantic background of the image, so that the simplification can be used as a priori for many visual applications.

Although semantic lines are widely used, semantic line detection is still a difficult problem due to the diversity of line segments and semantic backgrounds in pictures. On the one hand, the main semantic line detection methods are all used for detecting special cases in semantic lines, for example: lane lines, sea level lines, etc.; there are fewer ways to consider the detection of global semantic lines. On the other hand, due to no rule among semantic background information, the traditional convolutional neural network has difficulty in processing the detail information of the bottom layer and the semantic information of the upper layer. Therefore, early study based on deep learning is based on specific special conditions, such as lane lines, the target line areas are segmented by the edge detection filtering mode proposed by the sea level lines, and then the target lines existing in the detection pictures are extracted and detected by the conventional Hough transform or RANSAC algorithm. However, it is generally affected by factors such as brightness, occlusion, etc. of the input image, and its robustness is reduced. Therefore, the method is not suitable for detecting and identifying the information such as the semantic lines. Until 2020, kai et al proposed the combination of a feature pyramid network and a depth hough transform structure of a conventional hough transform to detect semantic network structures DHT (Zhao, k.; han, q.; zhang, c.b.; xu, j.; cheng, m.m. deep Hough Transform for Semantic Line detection.in IEEE Transactions on Pattern Analysis and Machine Intelligence,1sept.2022, vol.44, no.9, pp.4793-4806). The basic idea is that the multi-scale Hough transformation of the feature pyramid network is combined with the straight line of the high-level semantic information and the bottom-level detail information of the input image, and finally the prediction output of the semantic line combining the two information is performed through the regression prediction of the channel dimension. However, the network ignores the remote semantic background information of the pixels in the picture, so that the output semantic lines cannot accurately respectively detect the background information in the picture, and the prediction accuracy still needs to be improved.

Disclosure of Invention

The invention designs a semantic line detection method based on a strip-shaped convolution and attention mechanism of a depth Hough for solving the technical defect that global context information is ignored and irrelevant redundant information is greatly interfered in the existing semantic line recognition, and helps a network to better output a multi-scale image of an input image and reduce network calculation amount. Specifically, the improved semantic line detection network firstly improves the accuracy of network identification through a feature pyramid network added with a strip convolution layer, a strip mixed pooling layer and a feature attention selection module, simultaneously designs a lightweight structure combined with a Ghost convolution, finally, the feature extraction network outputs a multi-scale feature map, after Hough transformation is carried out on the multi-scale feature map, parameter space maps of different scales are aggregated along a channel dimension to predict final parameter space points, and then the points in the parameter space are converted into semantic lines of an image space through Hough inverse transformation.

The invention adopts the following technical scheme and implementation steps:

an improved semantic line detection method based on a depth hough network, the method comprising:

step 1: preprocessing image data;

step 2: constructing a strip convolution neural network for detecting semantic lines;

step 3: training the constructed network;

step 4: and carrying out semantic line detection test by using the trained strip convolution neural network.

The data preprocessing specifically comprises the following steps:

the data preprocessing refers to preprocessing photo data or an existing public data set acquired by equipment such as a camera, and the specific process is as follows: modifying the image data size in the acquired photograph or dataset to h×w, wherein H, W represents the height and width of the input image size, respectively; in addition, in order to improve the generalization capability of the network, the invention adopts a data overturning mode to realize data amplification.

The depth Hough attention scheme and the strip convolution network for semantic line detection specifically comprise the following modules:

module 1: banded convolution module

The strip convolution module is used for obtaining a multi-scale output matrix which is favorable for the network to carry out subsequent Hough transformation by obtaining remote semantic background information among pixels. The input is the preprocessed picture information in the network, the output is the middle output multi-channel image information obtained through convolution processing, and the parameters related to the strip convolution module are obtained through training and learning. The module performs three downsampling of the input picture information by using the strip coil and the operation, and the output dimension of the picture information remains unchanged. The specific calculation steps of the strip convolution module are as follows:

z(i，j)＝f _s (z ₁ (i，j)+z ₂ (i，j)) (1)

wherein: z1 _(i，j) ，z2 _(i，j) Is a vertically and horizontally striped convolved output,is the corresponding weight in the vertical and horizontal direction stripe convolutions, x (i, j) is the input of the corresponding (i, j) position, f _s (. Cndot.) is a1 x 1 convolution layer, z (i, j) is the output of a stripe convolution that combines vertical and horizontal. And finally, carrying out fusion on the residual error network structure and the initial x input according to the residual error network structure to output a final result.

Module 2: strip and space mixed pooling layer module

The stripe and space mixed pooling layer module is used for compressing the characteristics of pixels in the picture information and simplifying network parameters. The input is two-dimensional picture matrix information subjected to convolution transformation, and the output is a two-dimensional gray matrix composed of pixel values with n adjacent pixel characteristics aggregated. The specific operation steps are as follows: pixels obtained by utilizing short-range information of the spatial pooling layer and long-range information of the strip pooling layer are fused, and the pixels comprise global context information and channel context information. The specific calculation steps are as follows:

wherein:is the two-dimensional input picture information tensor of the strip pooling layer, H×W is the spatial height and width, H, W is the spatial range to be pooled, +.>Is the output of the horizontal stripe pool, +.>Is the output after vertical stripe pooling, w _i Is the weight of different pooling results, y _C，i，j Is the final pooled output result.

Module 3: feature selection module

The feature selection module is a channel attention network module for emphasizing important space details before channel dimension reduction is carried out on input features between feature pyramid networks, the input of the feature selection module is picture pixel information in an upper layer network, and the output of the feature selection module is picture pixel output for giving corresponding weight to channels in a picture. The specific operation steps are as follows: d channel weights u= [ u ] are output through a1×1 convolution operation by using a channel attention mechanism in a Senet network ₁ ，u ₂ ，…，u _D ]The channel dimension weight is multiplied by the input to give the original input channel dimension weight, and finally, the final picture output is represented by adding the channel dimension weight input and the original input and then performing 1×1 convolution operation, which is defined as follows:

u＝f _m (z) (7)

wherein:is the final output result, C _i Is the original input of the module, f _m (. Cndot.) is composed of 1×1 convolution layer and sigmoid activation layer responsible for giving corresponding weight to channel, z is C _i Obtained by global pooling, f _s (. Cndot.) is the responsibility of integrating the channel dimensions, consisting of 1 x 1 convolutional layers.

Module 4: ghost convolution light module

The Ghost convolution light module consists of classical convolution and low-cost convolution operation, is used for replacing an original convolution layer to reduce the calculation amount of the whole network, has the same input as the original classical convolution input and is picture pixel information, and is output as the characteristic of the whole picture information. The process is as follows: firstly, dividing input picture pixel information into two parts along a channel dimension, carrying out feature extraction on a first part through classical convolution to obtain an inherent feature map, then carrying out feature extraction on the inherent feature map which is input as the first part by a second part through low-cost convolution operation, and finally integrating the feature maps of the two parts along the channel dimension to output final feature extraction picture information.

Module 5: output module

The output module mainly comprises a connection function and a1×1 convolution layer. The regression prediction method is used for integrating parameter space diagrams of Hough transforms of different scales and performing threshold binarization on the integrated parameter space diagrams.

The training network model specifically comprises the following steps:

step 1: and inputting the preprocessed training data set into a model of the improved detection semantic line network based on the depth Hough for back propagation training, wherein learning parameters of the model comprise weights and bias items, and cosine annealing is adopted to adjust the learning rate to train the network model.

Step 2: introducing binary cross entropy loss function in parameter space

Wherein: y' _(x，y) Refers to the value of (x, y) in the matrix coordinates in the truth diagram in the parameter space, y _(x，y) And (3) referring to a value of (x, y) matrix coordinates in the prediction labels in the parameter space, carrying out back propagation by using a batch gradient descent method according to a loss function, and updating learning parameters of the model, wherein the learning parameters comprise weights and bias items.

Step 3: and (3) repeating the step (1) and the step (2), and continuously iterating and training network model parameters to obtain an optimal detection semantic line network model.

And carrying out semantic line detection test by using the trained model.

The beneficial effects are that:

the invention provides a semantic line detection method based on a depth Hough network, which can realize the detection task of semantic lines end to end through a designed strip convolution layer, a strip and space mixed pooling layer module, a feature selection module between networks and a Ghost convolution lightweight module. The semantic line detection network not only comprises local space features in the feature extraction process, but also integrates global context and depth context information, so that the detection precision of the network is improved, the calculation amount of the network is reduced, and the corresponding time of the network is improved. Under the given 1700 semantic line test data, the accuracy of 69.7% can be achieved by applying the invention, and the SOTA effect of the current data set can be achieved.

Drawings

Fig. 1 is a flowchart of a attention scheme and a strip convolution semantic line detection method based on a depth hough network provided by the invention;

fig. 2 is a network structure diagram of the attention scheme and the stripe convolution semantic line detection method based on the depth hough network provided by the invention;

fig. 3 is a diagram of recognition results of the attention scheme and the stripe convolution semantic line detection method based on the depth hough network. The first column is the original image, and the second column is the recognition result image.

Detailed Description

The invention aims to provide a semantic line detection method based on a depth Hough transformation network, which can finish training of the network end to end without any post-processing process.

The invention will now be described in detail with reference to the accompanying drawings, it being pointed out that the embodiments described are only intended to facilitate an understanding of the invention and do not in any way limit it.

FIG. 1 is a network flow diagram of semantic line detection provided by the present invention; FIG. 2 is a network structure diagram of semantic line detection provided by the present invention; fig. 3 is a result diagram of semantic line detection provided by the present invention.

The semantic line detection method based on the depth Hough network provided by the invention specifically comprises the following steps:

step 1: data preprocessing

The data preprocessing includes a preprocessing portion of the data set. It classifies the public dataset into training and testing sets and re-scales the size of all pictures within the dataset to a resolution size of 120 x 160.

Step two: constructing semantic line detection network

The invention discloses a deep Hough improvement network for semantic line detection, which specifically comprises a banded convolution module, a banded and spatial mixed pooling layer module, a feature selection module, a GhostNet module and an output module.

The stripe convolution module is a stripe convolution 1 multiplied by n and n multiplied by 1 convolution transformation designed by similar calculated amount with the space convolution in the original network, the input of the stripe convolution module is a preprocessed input picture, the output of the stripe convolution module is intermediate output picture information obtained by the convolution transformation, and parameters involved in the stripe convolution module are obtained by training and learning. The module is formed by a convolution kernel of 1×1, 1×5 and 5×1, a step size of 1, an output channel and an input channel are kept unchanged, padding=2, and finally picture information with a size of c×h×w is output. Where c is the output image channel. The specific calculation of the banded convolution is shown in a formula (1).

The stripe and space mixed pooling layer module is a pooling layer module for downsampling and fusing global context information of pixels in a picture, wherein the input is middle output picture information subjected to convolution transformation, and the output is feature information which integrates the peripheral information of the pixels and the global context information features of the pixels. The specific operation steps are as follows: the input is divided into three paths, namely a spatial pooling layer, a stripe pooling layer in the horizontal direction and a stripe pooling layer in the vertical direction. The spatial pooling layer is processed by adopting an average pooling layer. The horizontal and vertical stripe pooling layers are all intermediate processing information with global pixel average of 1×n or n×1 along the respective directions, up-sampling to the size of n×n by 1-dimensional convolution, and final mixed pooling results with feature output size c×h×w by 2-dimensional convolution and middle output of sigmoid and spatial pooling layers, as shown in fig. 2.

The feature selection module consists of a Squeeze layer and an expression layer and is used for improving the feature expressive force of a network and reducing the influence caused by redundant information, wherein the input is the upper layer picture feature extracted by convolution features in a forward network, and the output is the middle global feature output with channel weight; the process of improving the network characteristic expressive force by the characteristic selection module is as follows: the channel importance vector u is obtained by global pooling and 1×1 convolution and sigmoid activation operations first, and then multiplied by the input to obtain an intermediate output u×c _i Then and input C _i After the addition operation, the upper-layer semantic picture features with the output size of c×h×w are integrated by 1×1 convolution.

The Ghost light convolution module is composed of a basic convolution layer and a low-cost convolution operation by utilizing two parts of convolution layers. The input is a preprocessed input picture, and the output is intermediate output picture information obtained by two-part convolution layer transformation and connection along a channel dimension. The process of extracting the picture features by the GhostNet module is as follows: firstly, a basic convolution layer generates classical features by adopting a set proportion (the proportion of classical convolution and low-cost convolution is 1:1 is selected), then the classical features generated in the first part are transferred into the low-cost convolution part to generate low-cost features, finally, the features generated in the two parts are connected along the channel dimension to obtain c multiplied by H multiplied by W features, and n is the channel dimension of the original classical convolution. The convolution kernel has a size of [1,3,1], step size of 1.

The output module combines the multi-scale parameter space features obtained in the depth hough stage by utilizing an aggregation function, and the 1×1 convolution layer module with the output channel of 1 outputs a three-dimensional matrix with the size of 1×120×160 through the input channel 512, so that the final regression prediction result in the parameter space can be obtained through thresholding.

The process of detecting the semantic line network specifically comprises the following steps:

in order to perform fusion processing on multi-scale features of semantic features and detail features of an input picture, the preprocessed input picture is subjected to feature extraction through a band convolution module, a band and space mixed pooling layer and a forward network of a Ghostnet convolution layer, wherein each pixel in output multi-scale picture information consists of n×n spatial information and b×1 and 1×b remote background information, and n=3 and b=5. The definition of each semantic pixel is as follows:

and then, the feature selection module between the reverse network and the network is adopted to perform fusion output of the upper semantic features extracted by the forward network and the lower detailed features of the reverse network, and compared with the low-dimensional features, the high-dimensional features have the characteristic of channel information redundancy, and the feature selection module can avoid error influence caused by redundant information. Finally, carrying out Hough transformation of the multi-scale features obtained in the feature extraction stage to a parameter space, carrying out bilinear interpolation on the parameter space features of different scales to a uniform scale, and then carrying out fusion regression prediction along the channel dimension to output semantic line prediction probability in the parameter space.

Step 3: training a network model:

the hardware environment of the invention is Intel Platinum 8255C processor, the memory is 11GB, and the GPU is 2080Ti; the operating environments are Ubuntu18.04 operating system, CUDA11.0, pyTorrch1.7.0 and python3.8.

Firstly, taking photo information of a preprocessed training data set with the batch processing size of 8 as input, carrying out forward propagation training on a model of a semantic line detection network of a construction point to learn parameters of the network, wherein the learning rate has a value of 0.02, the batch normalization momentum is 0.9, the maximum iteration number is 30, and the batch processing size is 8; and (3) adopting a Nesterov momentum algorithm (Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models, adan for short) optimizer to perform back propagation training, and updating learning parameters such as weights, bias items and the like of the model. In the model training and optimizing process, adan enables each parameter to obtain self-adaptive learning rate so as to achieve double improvement of optimization quality and speed; finally, training network model parameters through iteration for 30 times to obtain an optimal semantic line detection network model. And then inputting the output semantic line parameter space predicted value and the true semantic line parameter space value into a binary cross entropy loss function (9), and carrying out back propagation by using a batch gradient descent method. And finally, updating the learning parameters of the iterative model by adopting the cosine annealing updating learning rate to obtain the optimal detection network model.

Step 4: and performing detection test by using the trained detection semantic line network.

And carrying out semantic line detection test by using the trained semantic line detection network. The test phase is different from the training phase in that only a trained network model is needed to be loaded in the test phase, and the network model is not needed to be trained again.

The foregoing is merely illustrative of the embodiments of the present invention, and the scope of the present invention is not limited thereto, and any person skilled in the art will appreciate that modifications and substitutions are within the scope of the present invention, and the scope of the present invention is defined by the appended claims.

Claims

1. A method for detecting attention schemes and banded convolution semantic lines based on a depth hough network, which is characterized by comprising the following steps:

step 1: preprocessing data of a data set, inputting the preprocessed data into a model based on a attention scheme and a strip convolution semantic line detection of a depth Hough network, marking semantic line true value lines of the data set data by adopting an online public data set and picture information acquired by a camera by adopting a human expert;

step 2: the method comprises the steps of constructing a semantic line detection network, wherein the semantic line detection network comprises a strip convolution module, a strip and space mixing pooling module, a feature selection module, a GhostNet convolution module and an output module;

step 3: model training: inputting the preprocessed training data set photo information into a semantic line detection network model for forward propagation calculation to obtain a final prediction result; then, converting the output prediction result and the truth diagram into a parameter space, inputting the parameter space into a loss function, and carrying out directional propagation by using a batch gradient descent method; adopting an Adan optimizer to update each parameter of a model, wherein the learning parameters of the model comprise weights and bias items;

step 4: and carrying out semantic line detection by using the trained semantic line detection network model.

2. The method according to claim 1, wherein the preprocessing of step 1 means: the input picture information of the network needs to be scaled to the size of the network requirement in the preprocessing process.

3. The method according to claim 1, wherein the specific calculation steps of the strip-shaped convolution layer are as follows:

z(i,j)＝f _s (z ₁ (i,j)+z ₂ (i,j)) (1)

wherein: z1 _(i,j) ,z2 _(i,j) Is a vertically and horizontally striped convolved output,is the corresponding weight in the vertical and horizontal direction stripe convolutions, x (i, j) is the input of the corresponding (i, j) position, f _s (. Cndot.) is a1 x 1 convolution layer, z (i, j) is the output of a combined vertical and horizontal stripe convolution; and finally, carrying out fusion on the residual error network structure and the initial x input according to the residual error network structure to output a final result.

4. The method of claim 1, wherein the stripe and spatial hybrid pooling layer module means: pixels obtained by utilizing the short-range information of the space pooling layer and the long-range information of the strip pooling layer are fused comprise global context information and channel context information; the specific calculation steps are as follows:

wherein:is the two-dimensional input picture information tensor of the strip pooling layer, H×W is the spatial height and width, H, W is the spatial range to be pooled, +.>Is the output of the horizontal stripe pool, +.>Is the output after vertical stripe pooling, w _i Is the weight of different pooling results, y _C,i,j Is the final pooled output result.

5. The method of claim 1, wherein the feature selection module of step 1 refers to: the channel attention network module is used for emphasizing important space details before channel dimension reduction is carried out on input features between the feature pyramid networks, wherein the input of the channel attention network module is picture pixel information in an upper layer network, and the output of the channel attention network module is picture pixel output for giving corresponding weight to channels in a picture; the specific operation steps are as follows: using channels in a Senet networkAttention mechanism, output D channel weights u= [ u ] through 1×1 convolution operation ₁ ,u ₂ ,…,u _D ]The channel dimension weight is multiplied by the input to give the original input channel dimension weight, and finally, the final picture output is represented by adding the channel dimension weight input and the original input and then performing 1×1 convolution operation, which is defined as follows:

u＝f _m (z) (7)

6. The method of claim 1, wherein the Ghost convolution lightweight module means: in the stripe convolution module, first, 1×1, 1×5, 5×1, and 1×1 convolutions are decomposed, both 1×1 before the depth convolution and 1×1 after the depth convolution are replaced, a Ghost convolution reduction parameter is used, and finally, the two convolutions are spliced.