CN112509046B

CN112509046B - Weak supervision convolutional neural network image target positioning method

Info

Publication number: CN112509046B
Application number: CN202011437759.8A
Authority: CN
Inventors: 罗杨; 濮希同; 骆春波; 徐加朗; 张赟疆; 韦仕才; 许燕
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-09-21
Anticipated expiration: 2040-12-10
Also published as: CN112509046A

Abstract

The invention discloses a weakly supervised convolutional neural network image target positioning method, which comprises the following steps: establishing a convolutional neural network classification model with a batch normalization layer, training the convolutional neural network classification model, and storing after training; inputting the image to be positioned into the convolutional neural network classification model trained in the step S1, and acquiring a feature map output by the depth convolutional layer; carrying out weighted fusion on the obtained characteristic images to obtain a saliency map; converting the obtained saliency map into a thermodynamic map and superimposing the thermodynamic map on the input image to generate a composite image; storing or visualizing the obtained synthetic image to obtain a target positioning image; the invention relates to a weak supervision convolutional neural network image target positioning method which utilizes batch normalization scaling factors as corresponding feature map weights, and solves the problems that the prior art is complex to realize, needs information such as category confidence scores and gradients, is opaque in a convolutional neural network model and is poor in working function.

Description

Weak supervision convolutional neural network image target positioning method

Technical Field

The invention relates to the field of image positioning, in particular to a method for positioning a convolutional neural network image target under weak supervision.

Background

Convolutional neural networks make a breakthrough in image classification tasks at the earliest time, and are widely applied to various fields, such as image target positioning, due to the outstanding feature extraction capability of the convolutional neural networks. When the convolutional neural network is applied to image classification, only simple coding needs to be carried out on image categories, and in a target positioning task, a target position in an image needs to be manually marked by a bounding box in advance. Therefore, the object localization task requires more supervision and has more challenges than the image classification task.

The method comprises the steps of utilizing weight parameters output by target categories learned by a global average pooling layer to carry out weighted addition on feature maps output by a last layer of convolution layer to obtain a similar activation map, and utilizing a position area of a target in a similar activation map highlight image to carry out target positioning. The relatives combine the feedback information of the convolutional neural network, calculate the partial derivative of the class fraction output by the convolutional neural network to the output characteristic diagram of the last layer of convolutional layer of the convolutional neural network through back propagation, obtain a class activation diagram by using the global average pooling value of the partial derivative as the weight of the corresponding characteristic diagram, and further position the image target. The relativistic scholars carry out weighted average pixel by pixel on the gradient of the feature map on the basis of predecessors, the obtained average value is used as the weight of the corresponding feature map, and then the obtained category activation map carries out target positioning.

The prior art relies on the global average pooling layer and the class confidence scores output by the convolutional neural network model, and for the convolutional neural network model not containing the global average pooling layer, structural changes need to be made to add the global average pooling layer. In other prior arts, the weight parameters corresponding to the feature map are calculated through back propagation, and a large number of parameters need to be added, so that the calculation complexity is high and the implementation is complex. The inference process of the convolutional neural network classification model can predict the image target class only by forward propagation, which is not consistent with the mechanism of the information flow mode of the backward propagation in other prior art, so that the internal mechanism of the operation of the convolutional neural network cannot be well explained.

Disclosure of Invention

Aiming at the defects in the prior art, the weakly supervised convolutional neural network image target positioning method provided by the invention solves the problems that the conventional target positioning method needs bounding box annotation information and needs back propagation calculation and model structure modification, so that the realization is complex, and simultaneously also solves the problem that the opaque working mechanism of a convolutional neural network model cannot be well explained.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

a weak supervision convolutional neural network image target positioning method comprises the following steps:

s1, establishing a convolutional neural network classification model with a batch normalization layer, training the convolutional neural network classification model, and storing after training;

s2, inputting the image to be positioned into the convolutional neural network classification model trained in the step S1, and acquiring a feature map output by the depth convolutional layer;

s3, carrying out weighted fusion on the characteristic graphs obtained in the step S2 to obtain a saliency map;

s4, converting the saliency map obtained in the step S3 into a thermodynamic map, and superposing the thermodynamic map on the original input image to generate a composite image;

and S5, storing or visualizing the synthetic image obtained in the step S4 to obtain a target positioning image.

The invention has the beneficial effects that: the method effectively utilizes the characteristic information extracted by the convolutional neural network, can position the target in the image without manually marking position information, and greatly reduces the labor cost. And secondly, target positioning can be carried out only by utilizing forward propagation of the classification model, the complexity is low, the implementation is easy, and the method is closer to the inference process of a neural network. Finally, the working mechanism in the convolutional neural network can be explained to a certain extent, so that people can more deeply understand the working mechanism of the convolutional neural network.

Further, step S2 includes the following substeps:

s21, obtaining the characteristic diagram of the first layer output of the convolution layer from the input image I, and showing the characteristic diagram as follows:

X^l＝H(I)，I∈R³

wherein, X^lA feature diagram representing the convolution output of the l layer, and H (-) represents the layers 1 to l of the convolution neural networkAn abstract function of convolution calculations;

s22, converting the feature diagram X^lA batch normalization process was performed, expressed as:

wherein the content of the first and second substances,

the characteristic diagram of the ith channel of the ith layer after batch normalization calculation is shown, the index i represents the channel index, BN (-) represents batch normalization calculation,

scaling factor, μ, representing the batch normalization calculation for layer l_iMean, δ, representing the ith channel profile_iRepresents the variance of the ith channel profile,

representing the bias of batch normalization calculation of the l layer, wherein N represents the total channel number of the characteristic diagram of the l layer;

s23, carrying out nonlinear processing on the characteristic diagram by adopting a ReLU activation function to obtain an output characteristic diagram of the l-th layer, wherein the output characteristic diagram is represented as follows:

wherein, F^l(x, y) represents a deep convolutional layer output characteristic diagram, ReLU (phi) represents a ReLU activation function, max (phi) represents the maximum value of the ReLU activation function and the max (phi), and (x, y) represents the spatial pixel point coordinates of the characteristic diagram.

The beneficial effects of the above further scheme are: the target is positioned only by adopting the characteristic diagram obtained by the forward propagation calculation of the convolutional neural network model, the characteristic extraction capability of the model is fully utilized, and the condition that the existing target positioning method needs additional strong supervision marking information is avoided. The effective information of the intermediate stage is obtained in the deduction process of the model for positioning, so that the method is more consistent with the operation mechanism of the model and is beneficial to reasonably explaining the operation mechanism in the model.

Further, the specific process of step S3 is:

the characteristic diagram obtained in the step S2

Batch normalization scaling factor corresponding to the convolutional layer along the channel direction

A weighted addition is performed to obtain a two-dimensional saliency map, which is represented as:

wherein S (x, y) represents a saliency map;

the beneficial effects of the above further scheme are: the existing parameters of the model are directly used as the weighted values of the weighted fusion of the characteristic diagram channels, the model structure does not need to be changed, the class confidence scores of the model prediction do not need to be changed, the weighted values do not need to be obtained through large-scale calculation, and the target positioning speed is improved.

Further, step S4 includes the following substeps:

s41, filtering the negative values in the saliency map S obtained in step S3, and enlarging the saliency map S to the same size as the original image I using an interpolation algorithm, as follows:

R(x,y)＝resize(max(S(x,y),0))

wherein, R (x, y) represents the enlarged saliency map, and resize (·) represents the interpolation function;

s42, the values in the enlarged saliency map R (x, y) are normalized and expressed as:

wherein R '(x, y) represents the normalized saliency map, and R' (x, y) is ∈ [0,1], max (·) and min (·) represent functions for maximum and minimum values, respectively;

s43, converting the normalized saliency map R' (x, y) into a thermodynamic map and adding the thermodynamic map to the original image element by element to obtain a composite image, which is represented as:

M(x,y)＝H(x,y)+I(x,y)

where M (x, y) represents a composite image, H (x, y) represents a thermodynamic diagram generated by R' (x, y), I (x, y) represents an original input image, and I (x, y) is ∈ [0,1 ].

The beneficial effects of the above further scheme are: the noise in the saliency map can be reduced by filtering the negative value, and the accuracy of target positioning is effectively improved; enlarging the saliency map to the same size as the original image makes it possible to combine the features extracted by the model with the input image; normalization can make the target area in the saliency map more prominent and convert the saliency map into an image target localization mask; the combination of the saliency map and the input image may make the pixel values of the target position region in the input image larger and more prominent for efficient localization.

Further, the specific process of step S5 is:

the values in the composite image M are mapped into the [0,255] interval, as:

wherein L (x, y) represents the mapped image, and L (x, y) is ∈ [0,255 ];

and finally, storing or visualizing the mapped image L (x, y) in an image format to obtain a required target positioned image.

The beneficial effects of the above further scheme are: the positioned synthetic image is mapped into a digital image format which can be recognized by a computer, and the image target positioning result can be displayed and observed conveniently. By adopting the visual result, the internal operation mechanism of the model can be more intuitively and reasonably explained.

Drawings

FIG. 1 is a flow chart of a method for weakly supervised convolutional neural network image target localization of the present invention;

FIG. 2 is a flowchart of step S2 according to the present invention;

FIG. 3 is a flowchart of step S4 according to the present invention;

FIG. 4 is a schematic diagram of an implementation of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, a method for positioning an image target of a weakly supervised convolutional neural network includes the following steps:

establishing a convolutional neural network classification model with a batch normalization layer, inputting a training set in an image classification data set into a convolutional neural network, training the classification model, enabling the model to extract deep features of a target in the image, and storing the trained convolutional neural network classification model;

inputting the image to be positioned containing the classified target into the trained convolutional neural network classification model in S1, and obtaining a feature map output by the last layer of convolutional layer through calculation;

because the size of the input image is different from that of the convolutional neural network model, the space size of the feature map output by the last convolutional layer is different, the space size of the feature map applicable in the step needs to be larger than 7 multiplied by 7, and when the space size of the feature map is too small, target positioning cannot be well carried out. When the input image is small, in order to ensure an appropriate feature map space size, a feature map output by a relatively earlier convolutional layer may also be used.

In the embodiment of the present invention, step S2 shown in fig. 2 includes the following sub-steps:

s21, obtaining the feature map output by the ith layer, and assuming that the calculation flow of the feature map in the ith layer is the sequence of convolution → batch normalization → ReLU, the calculation formula of the feature map output by the ith layer convolution obtained from the input image I is as follows:

X^l＝H(I)，I∈R³

wherein, X^lA characteristic diagram representing the convolution output of the l layer, wherein H (-) represents an abstract function of convolution calculation of the 1 st layer to the l layer of the convolution neural network;

s22, converting the feature diagram X^lCarrying out batch normalization calculation, wherein the calculation formula is as follows:

wherein the content of the first and second substances,

representing the bias of batch normalization calculation of the l layer, wherein N represents the total channel number of the characteristic diagram of the l layer; each feature map channel is normalized by subtracting the mean value and dividing by the variance, and the corresponding feature map channel

Multiplication of each other

The importance of the ith feature map may be represented.

wherein, F^l(x, y) represents a deep convolutional layer output characteristic diagram to be obtained by the invention, ReLU (·) represents a ReLU activation function, max (·,) represents taking the maximum value of the two, and (x, y) represents the space pixel point coordinates of the characteristic diagram.

reading the scaling factor beta of the batch normalization layer in the model depth convolution layer as the weight of the feature map obtained in the step S2, and performing weighted fusion on the feature map to obtain a saliency map;

in this embodiment of the present invention, the specific content of step S3 includes:

the characteristic diagram obtained in the step S2

Batch normalization scaling factor corresponding to the layer along the channel direction

Weighted addition is performed to obtain a two-dimensional saliency map, which is represented as:

wherein S (x, y) represents a saliency map;

the region with the larger value in the saliency map S is the spatial position of the object in the original image.

and amplifying the saliency map to the same size of the original image in space by using an interpolation algorithm, converting the amplified saliency map into a thermodynamic map, overlapping the thermodynamic map on the original input image to generate a synthetic image, and mapping the synthetic image to a reasonable range for storage or visualization to obtain the target positioning image.

In the embodiment of the present invention, step S4 shown in fig. 3 includes the following sub-steps:

R(x,y)＝resize(max(S(x,y),0))

wherein, R (x, y) represents an enlarged saliency map, resize (·) represents an interpolation function, which has the effect of enlarging the saliency map S (x, y) into R (x, y), and the size of the enlarged saliency map R (x, y) is the same as that of the input original image I;

s42, normalizing the values in the enlarged saliency map R (x, y) for thermodynamic map conversion, as:

s43, converting the normalized saliency map R' (x, y) into a thermodynamic map, and then adding the thermodynamic map element by element to the original image to obtain a composite image matrix, which is expressed as:

M(x,y)＝H(x,y)+I(x,y)

Because of the size of the pixel values p e [0,255] in the digital image, if the composite image M (x, y) is to be saved or displayed in an image format, the values in the composite image M (x, y) need to be mapped into the [0,255] interval:

in the embodiment of the present invention, the specific content in step S5 includes:

the values in the composite image M (x, y) are mapped into the [0,255] interval, as:

wherein L (x, y) represents the mapped image, and L (x, y) is ∈ [0,255 ];

the mapped image L (x, y) is stored in an image format or is processed in a visualization manner, and finally the image after the target is located is obtained, and the implementation process of the invention is shown in fig. 4.

It will be appreciated by those of ordinary skill in the art that the examples provided herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited examples and embodiments. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A method for positioning an image target of a weakly supervised convolutional neural network is characterized by comprising the following steps:

s3, carrying out weighted fusion on the characteristic graphs obtained in the step S2 to obtain a saliency map, specifically:

the feature map F obtained in the step S2 is used^l(x, y) batch normalization scaling factor corresponding to the convolutional layer along the channel direction

wherein S (x, y) represents a saliency map, subscript i represents a channel index, and N represents a total channel number of the feature map of the l-th layer;

s4, converting the saliency map obtained in the step S3 into a thermodynamic map, and superposing the thermodynamic map on the input image to generate a composite image;

2. The weakly supervised convolutional neural network image target localization method of claim 1, wherein the step S2 includes the following substeps:

X^l＝H(I)，I∈R³

s22, matching the characteristic diagram X^lA batch normalization process was performed, expressed as:

wherein the content of the first and second substances,

3. The weakly supervised convolutional neural network image object localization method of claim 1, wherein the step S4 includes the following substeps:

R(x，y)＝resize(max(S(x，y)，0))

s42, normalizing the values in the enlarged post-saliency map R (x, y), and expressing:

M(x，y)＝H(x，y)+I(x，y)

4. The weakly supervised convolutional neural network image target localization method as claimed in claim 3, wherein the specific process in the step S5 includes:

mapping the values in the composite image M (x, y) into the [0,255] interval, as:

wherein L (x, y) represents the mapped image, and L (x, y) is ∈ [0,255 ];