CN114677558A

CN114677558A - Target detection method based on direction gradient histogram and improved capsule network

Info

Publication number: CN114677558A
Application number: CN202210234245.5A
Authority: CN
Inventors: 史振宁; 章妍妍; 倪宇昊; 余晨晨; 林晨曦
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-06-28

Abstract

The invention discloses a target detection method based on a direction gradient histogram and an improved capsule network. The improved capsule network utilizes a parallel convolution network to extract comprehensive characteristics, forms characteristic vectors through a redundancy-removing capsule network, utilizes a deconvolution image reconstruction network to realize image reconstruction, and trains a network model. And finally, extracting a feature graph of the center point of the detection frame and a feature graph of the scale of the detection frame by using the convolution layer of 3 x 256 and two parallel convolution kernels of 1 x 1 to form a corresponding target frame and output a detection result.

Description

Target detection method based on direction gradient histogram and improved capsule network

Technical Field

The invention relates to the technical field of target detection, in particular to a target detection method based on a direction gradient histogram and an improved capsule network.

Background

The traditional convolutional neural network extracts target characteristics through convolution operation and realizes network learning through back propagation so as to achieve the aim of target detection, and a remarkable result is obtained in a target detection task. However, the conventional convolutional neural network has the problems of insufficient attention to information such as relative positions and directions of elements in an image and information loss of a pooling layer, and factors such as barrier shielding and severe weather can seriously affect the detection and identification of a target. The capsule neural network provides the concept of the capsule to replace neurons in a partial convolutional neural network, expands the scalar quantity of the traditional neural network into a vector quantity, and effectively overcomes the defects of the traditional neural network. Therefore, the capsule neural network has higher overall recognition rate when being applied to target detection, and has higher accuracy and robustness to meet the target detection task under different influence factors.

However, conventional capsule networks have only one capsule of a given type at a given location, and thus two objects of the same type cannot be detected if a capsule network is too close to each other. The method is characterized in that a directional gradient histogram is adopted to preprocess an original image in a text of 'vehicle type classification of a capsule network with gradient histogram convolution characteristics under traffic monitoring', so that the problem that two objects of the same type cannot be detected is solved to a certain extent, but the problems of insufficient extraction of original image information, redundant capsule structures, high algorithm complexity and the like exist because the structure of the capsule network is not improved.

Disclosure of Invention

The invention aims to solve the technical problems that the original image information extraction is not rich enough, the capsule structure has redundancy and the algorithm complexity is higher in the prior art, and provides a target detection method based on a direction gradient histogram and an improved capsule network.

A target detection method based on a direction gradient histogram and an improved capsule network comprises the following steps:

1) obtaining a target original image, marking the target position by using a marking tool, and then randomly selecting different images as a training set;

2) Merging a direction gradient Histogram (HOG) of an original image and a convolution feature map in parallel to combine an edge contour feature of the image and a visual field feature of a convolution kernel, and taking the edge contour feature and the visual field feature as input of an improved capsule network;

3) the improved capsule network extracts comprehensive features by using a parallel convolution network, forms feature vectors by removing redundant capsule networks, and realizes image reconstruction by using a deconvolution image reconstruction network;

4) and extracting a feature map of the center point of the detection frame and a feature map of the scale of the detection frame by using the convolution layer of 3 x 256 and two parallel convolution kernels of 1 x 1 to form a corresponding target frame and output a detection result.

Preferably, the parallel fusion algorithm of the Histogram of Oriented Gradients (HOG) and the convolution feature map in the step 2) specifically comprises the following steps:

2-1) normalization; firstly, dividing a target image into 4 cells, dividing each cell into 9 blocks, carrying out Gamma normalization processing, and carrying out parameter optimization on a Gamma correction value; the normalization formula is as follows:

in the formula, tau represents a feature vector of a block, epsilon takes a smaller constant value,

and f represents the target image after the normalization of all the block feature vectors is completed.

2-2) selecting a detection window from the normalized image. A detection window is selected that is equal to the image aspect ratio and does not exceed one-half the image size.

2-3) selecting a block from the window. And selecting rectangular blocks with the same length and width according to the detection window.

2-4) dividing cell cells within a block. The block is divided within a rectangular block using square cells of 8 × 8 pixel size as the minimum unit for feature extraction within the block.

2-5) performing directional projection in the cell. 9 directions are divided in the cell, and direction information is extracted every 20 degrees as an angle range. The direction information is obtained by performing convolution operation on the gray level image I and the gradient template U in the x horizontal direction and the y vertical direction, and the mathematical formula is as follows:

G_x(x,y)＝H(x+1,y)-H(x-1,y)

G_y(x,y)＝H(x,y+1)-H(x,y-1)

in the formula, H (x, y) represents a gray scale value in a corresponding coordinate, G_xRepresenting a horizontal gradient value, G_yRepresents the vertical gradient value, G represents the gradient magnitude, and α represents the gradient direction.

2-6) normalization within the cell. And counting the actual direction angle quantity of each direction angle range in the cell to obtain a direction histogram, and selecting the direction angle with the most concentrated angle direction as the direction of the cell.

2-7) building HOG features within a block. And counting the actual direction angle number of the direction angle range of each cell in the block to obtain a direction histogram, and selecting the direction angle with the most concentrated angle direction as the direction of the block.

2-8) if the last block is not reached, returning to step 2-3).

2-9) if the last window is not reached, returning to the step 2-2), otherwise, obtaining a direction gradient histogram.

2-10) single row convolution. Inputting an original image, and performing feature extraction on the original image by using a single-row convolution layer to obtain a convolution feature map.

2-11) feature splicing and parallel fusion. And dimension connecting the convolution feature maps with the directional gradient histogram, and connecting the two feature maps in the third dimension to obtain an image of 28 × 1.

The pooling layer structure adopted in the traditional convolutional neural network does not consider the relative relation in space, thereby causing partial loss of value information of the layer. In order to solve the problem, a digital capsule layer is used in a capsule network to perform the function of a pooling layer, and a vector neuron is proposed, wherein the vector neuron stores information such as direction and position in a vector form and continuously transmits the information in the network, so that the capsule network can have sensitivity to the position and direction information of elements in an image.

The improved capsule network based on the directional gradient histogram extracts features through a convolutional layer and an HOG in parallel, an image preprocessing part of the improved capsule network adopts HOC-C features, the advantages of an original capsule network are kept, meanwhile, the extraction of edge feature information of a target detection image is enhanced through the introduction of the directional gradient histogram, the identification degree between two capsule networks which are too close to each other is increased through the gradient direction and the amplitude, and the problem that two objects of the same type cannot be detected is solved to a certain extent.

Preferably, the step 3) of improving the capsule network algorithm specifically comprises the following steps:

3-1) extracting comprehensive characteristics by utilizing a parallel convolution network.

3-2) generating the feature vector by utilizing the redundancy removing capsule network.

And 3-3) restoring the original image by using the deconvolution image reconstruction network, and evaluating the network loss.

Preferably, the parallel convolution network algorithm in step 3-1) specifically comprises the following steps:

3-1-1) first use a parallel convolutional neural network as the feature extraction network. The parallel fused images were used as input, with an image size of 28 × 1. The parallel convolution neural network adopts 4 convolution kernels in the convolution layer, the sizes of the convolution kernels are respectively 3, 5, 7 and 9, the number of the convolution kernels is selected to be 32, and the step length is 2.

3-1-2) boundary filling. And adjusting padding size to carry out boundary filling on the original matrix.

3-1-3) feature extraction. The nonlinear function of the feature extraction layer adopts a PReLU function, and the mathematical formula is as follows:

PReLU(x)＝max(0,x)+α*min(0,x)

in the formula, α is a learning rate.

3-1-4) feature tensor concatenation. The third dimension enables the connection of the feature tensors, resulting in a feature tensor of 14 × 128.

Preferably, the redundancy removing capsule network algorithm in the step 3-2) specifically comprises the following steps:

3-2-1) input redundancy removal. And (3) adopting the output of the parallel convolution network as the input of the redundancy removing main capsule network, and removing redundant capsules by using a convolution kernel of 1 × 1 so that the feature map of 16 × 16 after feature extraction is converted into a feature image of 14 × 14, and the number of capsules is reduced to 196.

3-2-2) input capsule scalar u_i。

3-2-3) vector obtained by multiplying input capsule vector by transformation matrix

The mathematical formula is as follows

In the formula W_ijIs a transformation matrix.

3-2-4) dividing the vector

And coefficient of coupling c_ijWeighted summation is carried out to obtain weighted sum s_jThe mathematical formula is as follows:

3-2-5) Using a nonlinear function pair s_jCompressed and propagated forward, the mathematical formula is as follows:

in the formula s_jRepresenting a weighted sum, v_jRepresenting a non-linear compression function.

3-2-6) update the coupling coefficient c using the softmax equation_ijThe mathematical formula is as follows:

in the formula

Indicating the updated coupling coefficient of the dynamic route,

is set to 0. v. of_jRepresenting non-linear compression functions, vectors

I.e. the multiplication of the capsule vector and the transformation matrix.

3-2-7) pairs of v_jThe number of updates k is the number of capsules, the final result is

I.e. v of the final output_jI.e. the feature vector characterizing the jth class. Otherwise, returning to the step 3-2-2).

Preferably, the deconvolution image reconstruction network algorithm in step 3-3) specifically comprises the following steps:

3-3-1) image input. The input image size is 14 × 14, the feature input is 5 × 5, the convolution kernel sizes are 3, 5, 7, 9, respectively, the number of convolution kernels is selected to be 32, the step size is 2, and the output image is 28 × 28 by adjusting the padding. The output image size after input and deconvolution is given by the following equation:

In the formula

Denotes the output image size, s denotes the step size,

denotes the input image size, k denotes the convolution kernel size, and p denotes padding, i.e., the fill size.

3-3-2) splitting the encapsulation information. For 6272 neurons in the capsule, the full connectivity layer translates into a 14 × 32 tensor, which, combined with connectivity in the parallel convolutional network in the third dimension, makes the 14 × 32 tensor into a 14 × 160 tensor.

3-3-3) redundancy removal deconvolution. Redundancy is removed from the 14 × 160 tensor using a convolution kernel of 1 × 1, and finally a 14 × 32 tensor is obtained, and a deconvolution operation is performed on the corresponding convolution kernel in the parallel convolution layer, thereby generating a final reconstructed image.

Preferably, the step 4) of extracting the feature map of the center point of the detection frame and the feature map of the scale of the detection frame, forming the corresponding target frame and outputting the detection result specifically comprises the steps of:

4-1) image input. The reconstructed image was used as input to a convolution layer of size 3 x 3 and output channel 256.

4-2) extracting a central feature map. And extracting a feature map of the center point of the detection frame by using a 1-by-1 convolution kernel, wherein the coordinate of the center point of the object is marked as a positive value, and the coordinate of the center point of the non-object is marked as a negative value.

4-3) extracting a scale feature map. And extracting a detection frame scale feature map by using parallel 1-by-1 convolution kernels. And the scale in the scale feature map is the length and width of the detection frame.

4-4) outputting the image. And fusing the central characteristic diagram and the scale characteristic diagram to complete object target detection.

The invention has the beneficial effects that:

the invention aims at the problem of insufficient extraction of original image information in the prior art, improves the capsule network structure, adds a parallel convolution network layer in front of a main capsule layer, and after the parallel convolution network extracts convolution characteristics, the convolution characteristic graph and the direction gradient histogram of an original image are directly fused in a dimension vector connection mode instead of a network mode, so that the input information of the network is richer and closer to the original image information.

The invention aims at the problems of redundancy and high algorithm complexity of the existing capsule structure, and performs redundancy removal operation on the main capsule. The redundancy-removing main capsule network simplifies the network structure, reduces the parameters compared with the original main capsule network, optimizes the time required by network training and improves the algorithm efficiency.

Drawings

FIG. 1 is a schematic diagram of the overall structure of a target detection model according to the present invention;

FIG. 2 is a flow chart of a parallel model of Histogram of Oriented Gradients (HOG) and a convolution feature map in accordance with the present invention;

FIG. 3 is a diagram of the improved capsule network architecture of the present invention;

FIG. 4 is a schematic diagram of a parallel convolutional network in the present invention;

FIG. 5 is a schematic diagram of a redundancy elimination capsule network according to the present invention;

FIG. 6 is a schematic diagram of a deconvolution image reconstruction network in accordance with the present invention;

FIG. 7 is an original image;

FIG. 8 is a schematic diagram of HOG histogram processing images;

FIG. 9 is a schematic diagram of a detection result of a pedestrian image by the method.

Detailed Description

The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

As shown in fig. 1, a target detection method based on histogram of oriented gradients and improved capsule network mainly includes the following steps:

1) the target original image is obtained as shown in fig. 7.

2) As shown in fig. 2, the histogram of the directional gradient of the original image is fused in parallel with the convolution feature map to combine the edge contour feature of the image with the view feature of the convolution kernel, and then the result is used as the input of the improved capsule network;

2-1) normalization. Firstly, dividing a target image into 4 cells, dividing each cell into 9 blocks, carrying out Gamma normalization processing, and carrying out parameter optimization on a Gamma correction value; the normalization formula is as follows:

2-3), blocks are selected from the window. And selecting rectangular blocks with the same length and width according to the detection window.

2-4), cell cells are divided within a block. The block is divided within the rectangular block using square cells of 8 × 8 pixels size as the minimum unit of feature extraction within the block.

2-5), performing directional projection in the cell. 9 directions are divided in the cell, and direction information is extracted by taking an angle range of every 20 degrees. The direction information is obtained by performing convolution operation on the gray level image I and the gradient template U in the x horizontal direction and the y vertical direction, and the mathematical formula is as follows:

G_x(x,y)＝H(x+1,y)-H(x-1,y)

G_y(x,y)＝H(x,y+1)-H(x,y-1)

in the formula, H (x, y) represents a gray value at the corresponding coordinate, G_xRepresenting a horizontal gradient value, G_yRepresents the vertical gradient value, G represents the gradient magnitude, and α represents the gradient direction.

2-6) normalization within the cell. The actual number of direction angles in each direction angle range is counted in the cell to obtain a direction histogram, as shown in fig. 8. And selecting the direction angle with the most concentrated angle directions as the direction of the cell.

2-7) building HOG features within blocks. And counting the actual direction angle number of the direction angle range of each cell in the block to obtain a direction histogram, and selecting the direction angle with the most concentrated angle direction as the direction of the block.

2-8) if the last block has not been reached, return to step 3).

2-9) if the last window is not reached, returning to the step 2), otherwise, obtaining a directional gradient histogram.

2-10) single row convolution. Inputting an original image, and performing feature extraction on the original image by using a layer of single-line convolution layer to obtain a convolution feature map.

3) As shown in fig. 3 and 4, the improved capsule network extracts comprehensive features by using a parallel convolution network, forms feature vectors by using a redundancy-removing capsule network, and realizes image reconstruction by using a deconvolution image reconstruction network;

3-1) extracting comprehensive characteristics by using a parallel convolution network;

3-1-3) feature extraction. The nonlinear function of the feature extraction layer adopts a PReLU function, and the mathematical formula of the PReLU function is as follows:

PReLU(x)＝max(0,x)+α*min(0,x)

in the formula, α represents a learning rate.

3-1-4) feature tensor connections. The third dimension enables the connection of the feature tensors, resulting in a feature tensor of 14 × 128.

3-2) generating feature vectors using the redundancy removal capsule network as in FIG. 5;

3-2-1) input to remove redundancy. And (3) adopting the output of the parallel convolution network as the input of the redundancy removing main capsule network, and removing redundant capsules by using a convolution kernel of 1 × 1 so that the feature map of 16 × 16 after feature extraction is converted into a feature image of 14 × 14, and the number of capsules is reduced to 196.

3-2-2) input capsule scalar u_i。

The mathematical formula is as follows:

in the formula W_ijIs a transformation matrix.

3-2-4) to convert the vector

3-2-6) update the coupling coefficient c using the softmax equation _ijThe mathematical formula is as follows:

in the formula

Indicating the updated coupling coefficient of the dynamic route,

is set to 0. v. of_jRepresenting non-linear compression functions, vectors

I.e. the multiplication of the capsule vector and the transformation matrix.

3-3) as in FIG. 6, the original image is restored using the deconvolution image reconstruction network.

in the formula

Representing the output image size, s represents the step size,

representing the input image size, k the convolution kernel size, and p the padding, i.e. the fill size.

4-2) center feature map. And extracting a feature map of the center point of the detection frame by using a 1-by-1 convolution kernel, and marking the coordinate mark of the center point of the object as a positive value, while marking the coordinate mark of the center point of the object as a negative value.

4-3) a scale feature map. And extracting a detection frame scale feature map by using parallel 1-by-1 convolution kernels. And the scale in the scale feature map is the length and width of the detection frame.

4-4) outputting the image. And fusing the central feature map and the scale feature map to complete object target detection, as shown in fig. 9.

Claims

1. A target detection method based on a direction gradient histogram and an improved capsule network is characterized by comprising the following steps:

2) Merging the direction gradient histogram of the original image and the convolution characteristic graph in parallel to combine the edge contour characteristic of the image and the visual field characteristic of the convolution kernel, and taking the edge contour characteristic and the visual field characteristic as the input of an improved capsule network;

2. The histogram of oriented gradients and improved capsule network based object detection method of claim 1, wherein: the algorithm for parallel fusion of the direction gradient histogram and the convolution feature map in the step 2) comprises the following specific steps:

representing the ith block in the jth cell of the target image, and f representing the target image after the normalization of all block feature vectors is completed;

2-2) selecting a detection window from the normalized image, selecting a detection window having an aspect ratio equal to that of the image and not more than half the size of the image;

2-3) selecting blocks from the window, and selecting rectangular blocks with the same length and width according to the detection window;

2-4) dividing cell units in the block, and dividing the block by using square cells with 8 × 8 pixel size as minimum units for feature extraction in the block in a rectangular block;

2-5) performing directional projection in the cell, dividing 9 directions in the cell, extracting direction information in an angle range of every 20 degrees, wherein the direction information is obtained by performing convolution operation on the gray image I and the gradient template U in the x horizontal direction and the y vertical direction, and the mathematical formula is as follows:

G_x(x,y)＝H(x+1,y)-H(x-1,y)

G_y(x,y)＝H(x,y+1)-H(x,y-1)

in the formula, H (x, y) represents a gray value at the corresponding coordinate, G_xRepresenting a horizontal gradient value, G_yRepresents a vertical gradient value, G represents a gradient amplitude, and alpha represents a gradient direction;

2-6) carrying out normalization in the cell, counting the actual direction angle quantity of each direction angle range in the cell to obtain a direction histogram, and selecting the direction angle with the most concentrated angle direction as the direction of the cell;

2-7) constructing HOG characteristics in the block, counting the actual direction angle quantity of the direction angle range of each cell in the block to obtain a direction histogram, and selecting the direction angle with the most concentrated angle direction as the direction of the block;

2-8) if the last block is not reached, returning to the step 2-3);

2-9) if the last window is not reached, returning to the step 2-2), otherwise, obtaining a directional gradient histogram;

2-10) performing single-line convolution, inputting an original image, and performing feature extraction on the original image by using a single-line convolution layer to obtain a convolution feature map;

2-11) feature splicing and parallel fusion, performing dimension connection on the convolution feature maps and the directional gradient histogram, and connecting the two feature maps in the third dimension to obtain an image of 28 × 1.

3. The histogram of oriented gradients and improved capsule network based object detection method of claim 1, wherein: the specific steps of improving the capsule network algorithm in the step 3) are as follows:

3-2) generating a feature vector by utilizing a redundancy removing capsule network;

and 3-3) restoring the original image by using a deconvolution image reconstruction network, and evaluating the network loss.

4. The histogram of oriented gradients and improved capsule network based object detection method of claim 3, wherein: the parallel convolution network algorithm in the step 3-1) comprises the following specific steps:

3-1-1) firstly, using a parallel convolutional neural network as a feature extraction network, using an image after parallel fusion as input, wherein the size of the image is 28 × 1, the parallel convolutional neural network adopts 4 convolutional kernels in a convolutional layer, the sizes of the convolutional kernels are respectively 3, 5, 7 and 9, the number of the convolutional kernels is selected to be 32, and the step length is 2;

3-1-2) filling the boundary, and adjusting the padding size to fill the boundary of the original matrix;

3-1-3) feature extraction, wherein the nonlinear function of the feature extraction layer adopts a PReLU function, and the mathematical formula is as follows:

PReLU(x)＝max(0,x)+α*min(0,x)

in the formula, α is a learning rate;

3-1-4) the feature tensors are connected, and the third dimension realizes the connection of the feature tensors, so that the feature tensor of 14 × 128 is obtained.

5. The histogram of oriented gradients and improved capsule network based object detection method of claim 3, wherein: the redundancy removing capsule network algorithm in the step 3-2) comprises the following specific steps:

3-2-1) inputting redundancy removal, adopting the output of the parallel convolution network as the input of a redundancy removal main capsule network, removing redundant capsules by using a 1 x 1 convolution kernel so that a 16 x 16 feature diagram after feature extraction is converted into 14 x 14 feature images, and reducing the number of capsules to 196;

3-2-2) input capsule scalar u_i；

The mathematical formula is as follows

In the formula W_ijIs a transformation matrix;

3-2-4) to convert the vector

In the formula s_jRepresenting a weighted sum, v_jRepresenting a non-linear compression function;

in the formula

Indicating the updated coupling coefficient of the dynamic route,

is set to 0, v_jRepresenting non-linear compression functions, vectors

The vector is obtained by multiplying the capsule vector by a transformation matrix;

I.e. the final outputV is_jI.e. the feature vector characterizing the jth class, otherwise, return to step 3-2-2).

6. The histogram of oriented gradients and improved capsule network based object detection method of claim 3, wherein: the deconvolution image reconstruction network algorithm in the step 3-3) comprises the following specific steps:

3-3-1) image input, the size of the input image is 14 × 14, the feature input is 5 × 5, the sizes of convolution kernels are 3, 5, 7 and 9 respectively, the number of convolution kernels is selected to be 32, the step size is 2, the output image is 28 × 28 by adjusting the filling, and the output image size after input and deconvolution is as follows:

in the formula

Representing the output image size, s represents the step size,

representing the input image size, k the convolution kernel size, p the padding, i.e. the fill size;

3-3-2) splitting of encapsulation information, for 6272 neurons in the capsule, transforming into a 14 x 32 tensor by fully connecting layers, combining this tensor with connections in a parallel convolutional network in a third dimension such that the 14 x 32 tensor becomes a 14 x 160 tensor;

3-3-3) redundancy removal deconvolution, redundancy removal is performed on the 14 × 160 tensor using a 1 × 1 convolution kernel, finally a 14 × 32 tensor is obtained, and a deconvolution operation is performed on the corresponding convolution kernel in the parallel convolution layer, so that a final reconstructed image is generated.

7. The histogram of oriented gradients and improved capsule network based object detection method of claim 1, wherein: the step 4) of extracting the feature map of the center point of the detection frame and the feature map of the scale of the detection frame to form the corresponding target frame and outputting the detection result specifically comprises the following steps:

4-1) image input, using the reconstructed image as the input of the convolution layer with the size of 3 x 3 and the output channel of 256;

4-2) extracting a central feature map, extracting a feature map of the central point of the detection frame by using a 1 x 1 convolution kernel, marking the coordinate of the central point of the object as a positive value, and marking the coordinate of the central point of the non-object as a negative value;

4-3) extracting a scale feature map, and extracting a detection frame scale feature map by using parallel 1-by-1 convolution kernels, wherein scales in the scale feature map are the length and width of a detection frame;

And 4-4) outputting an image, and fusing the central characteristic diagram and the scale characteristic diagram to finish object target detection.