CN116433675B

CN116433675B - Vehicle counting method based on residual information enhancement, electronic device and readable medium

Info

Publication number: CN116433675B
Application number: CN202310711220.4A
Authority: CN
Inventors: 熊盛武; 严凯; 杨锴; 潘晟凯; 陈亚雄
Original assignee: Sanya Science and Education Innovation Park of Wuhan University of Technology
Current assignee: Sanya Science and Education Innovation Park of Wuhan University of Technology
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-08-15
Anticipated expiration: 2043-06-15
Also published as: CN116433675A

Abstract

The invention discloses a vehicle counting method based on residual enhancement information, which comprises the following steps: for each remote sensing image to be trained in a remote sensing image training set, a first feature map is obtained through a feature extraction network VGG19 module; the method comprises the steps of carrying out feature extraction on a first feature map by utilizing different convolutional neural networks, splicing features to obtain a second feature map, determining a pooling query tensor Q, a pooling key tensor K and a pooling value tensor V through an operation layer of a spatial attention module, carrying out information enhancement processing on the basis of a self-attention mechanism of the spatial attention module, carrying out feature fusion on a third feature map obtained after the information enhancement processing to obtain a prediction density map, training a vehicle counting model by utilizing the prediction density map and a real density map, and obtaining a trained vehicle counting model.

Description

Vehicle counting method based on residual information enhancement, electronic device and readable medium

Technical Field

The present invention relates to the field of traffic control systems and image recognition technologies, and in particular, to a vehicle counting method, an electronic device, and a readable medium based on residual information enhancement.

Background

Vehicle counting refers to a process of counting vehicles passing through a certain place for a certain period of time. It has important significance for urban traffic management and planning. First, the vehicle count may provide useful information to urban traffic planners in determining the location and size of future roads, bridges, and traffic junctions. By knowing the flow and congestion of different types of vehicles in different time periods, the road and mass transit systems can be optimized to better accommodate increasing urban traffic demands. In addition, through the analysis of the vehicle counting data, the traffic flow can be better known, and targeted traffic control measures such as synchronization of signal lamps, lane adjustment and the like are formulated so as to better manage urban traffic and reduce traffic jams and driving accidents.

Traditional manual counting of vehicles is extremely time consuming, cumbersome and error prone. Nowadays, the remote sensing image plays an important role in modern earth observation, the remote sensing image or remote sensing photo is a product of information obtained by various sensors, is an information carrier of a remote sensing detection target, can also extract a large amount of useful information just like photos taken in life, is capable of overlooking the ground from high altitude by a remote sensing platform, can judge traffic conditions of the whole city in a large range and in a whole area, can obtain the whole traffic condition information (namely image data) of the city at one time, and is very beneficial to providing basis for traffic guidance decision from a global angle. The rapid development of the combination of computer vision and remote sensing images provides a concept for counting vehicles efficiently.

The known counting method based on the remote sensing image is based on target counting of a density map, for example, CSRNet (see reference document: Y, li, X, zhang, and D, chen. Csrnet: dilated convolutional neural networks for understanding the highly congested scenes, CVPR 2018,2018: 1091-1100) extracts vehicle features of different scales of the remote sensing image by using a method of expanding convolution and multi-column convolution fusion to realize vehicle counting, however, the multi-column convolution fusion method proves that the results of each column are similar, the effect is different, the redundancy is large, and therefore the vehicle feature extraction quality is poor, so that the vehicle counting precision is not high.

Disclosure of Invention

In view of the above, an object of the present invention is to provide an improved vehicle counting method, electronic device and readable medium, which can accurately extract vehicle characteristics of remote sensing pictures and improve accuracy of vehicle counting in remote sensing scenes.

In order to achieve the above object, the present invention provides a vehicle counting method based on residual enhancement information, comprising:

s100, acquiring a plurality of remote sensing images to be trained, and adjusting each remote sensing image to be trained to be of a uniform size to obtain a remote sensing image training set;

s200, determining a predicted density map corresponding to each remote sensing image to be trained in a remote sensing image training set, and training a vehicle counting model according to the predicted density map and the real density map corresponding to each remote sensing image to be trained to obtain a trained vehicle counting model;

S300, after receiving the vehicle counting request of the target area, acquiring a target remote sensing image corresponding to the vehicle counting request of the target area, determining the number of vehicles of the target area based on the trained vehicle counting model,

in step S200, determining a predicted density map corresponding to each remote sensing image to be trained in the remote sensing image training set specifically includes:

s210, performing feature extraction on a remote sensing image to be trained by using a feature extraction network VGG19 module to obtain a first feature map F1;

s220, independently extracting features of the first feature map F1 by using different convolutional neural networks to obtain a global feature map and a local feature map, and splicing the global feature map and the local feature map to obtain a second feature map F2;

s230, inputting the first feature map F1 into three independent operation layers in a spatial attention module to determine a pooling query tensor Q, a pooling key tensor K and a pooling value tensor V according to a convolution feature map output by each operation layer, wherein each operation layer comprises a convolution layer conv4, a linear rectification layer relu4, a batch normalization layer bn4 and an addition layer add4, the input of the convolution layer conv4 is the first feature map F1, one output end of the convolution layer conv4 is connected to the input end of the linear rectification layer relu4, the output end of the linear rectification layer relu4 is connected to the input end of the batch normalization layer bn4, and the output end of the batch normalization layer bn4 is connected to the input end of the addition layer add4 together with the other output end of the convolution layer conv 4;

S240, based on a self-attention mechanism of the spatial attention module, performing information enhancement processing on the first feature map F1 to obtain a third feature map F3;

s250, processing a third feature map F3 of the remote sensing image to be trained through a third convolutional neural network, multiplying the third feature map F3 with a second feature map F2 channel by channel to obtain a fifth feature map F5, and adding and fusing the fifth feature map F5 and the second feature map F2 channel by channel to obtain a fused feature map;

and S260, determining a corresponding prediction density map according to the fusion feature map of the remote sensing image to be trained.

Preferably, in each of the operation layers of step S230,

for the convolution layer conv4, the window size of the convolution kernel is 3*3, and the sliding step length is 1, so as to output 512 feature graphs;

for the linear rectifying layer relu4, the linear rectifying processing formulas are as follows:

wherein ,_x each element of the input feature map representing a linear rectifying layer,representing a linear rectification function;

for the batch normalization layer bn4, the batch normalization processing formulas are as follows:

n is the number of all elements in the input feature map of the batch normalization layer; x is x _i I is the ith element of the input feature diagram of the batch normalization layer, and i is more than or equal to 1 and less than or equal to N; mu represents x _i Is the average value of (2); sigma represents x _i Standard deviation of (2); epsilon is a positive constant close to 0; x's' _i Table x _i Is a standard fraction of (2); gamma and beta are respectively a scaling parameter gamma and an offset parameter beta, the initial value of the scaling parameter gamma is 1, the initial value of the offset parameter beta is 0, y _i An ith element of the output feature map for the batch normalization layer;

for the addition layer add4, it adds the output of the batch normalization layer bn4 and the output of the convolution layer conv4, and outputs the corresponding feature map.

Preferably, step S250 specifically includes:

s251, processing the third feature map F3 by using a third convolutional neural network to obtain a fourth feature map F4, wherein the third convolutional neural network comprises a first convolutional layer conv5_1, a first linear rectifying layer relu5-1, a first batch of normalizing layers bn5-1, a second convolutional layer conv5_2, a second linear rectifying layer relu5-2, a second batch of normalizing layers bn5-2 and a third convolutional layer conv5_3 which are sequentially connected;

s252, multiplying the fourth characteristic diagram F4 and the second characteristic diagram F2 channel by channel to obtain a fifth characteristic diagram F5, and adding the fifth characteristic diagram F5 and the second characteristic diagram F2 channel by channel to obtain a fusion characteristic diagram.

Further preferably, in the third convolutional neural network,

for the first convolution layer conv5_1, the window size of the convolution kernel is 3*3, and the sliding step length is 1, so as to output 256 feature graphs;

For the second convolution layer conv5_2, the window size 3*3 of the convolution kernel is 1, and the sliding step size is used for outputting 128 feature graphs;

for the third convolution layer conv5_3, the window size of the convolution kernel is 1*1, and the sliding step length is 1, so as to output 1 feature map;

for the first linear rectifying layer relu5-1 and the second linear rectifying layer relu5-2, the linear rectifying processing formula is the same as the linear rectifying processing formula of the linear rectifying layer relu4 of the operation layer;

for the first batch normalization layer bn5-1 and the second batch normalization layer bn5-2, the batch normalization formula is the same as the batch normalization formula of the batch normalization layer bn4 of the operation layer, the initial value of the scaling parameter γ is 1, and the initial value of the offset parameter β is 0.

Preferably, step S220 specifically includes:

s221, performing feature extraction on the first feature map F1 by using a first convolutional neural network to obtain a global feature map, wherein the first convolutional neural network comprises a first convolutional layer conv2_1, a first batch of normalization layers bn2-1, a first linear rectifying layer relu2-1, a second convolutional layer conv2_2, a third convolutional layer conv2_3, a fourth convolutional layer conv2_4, a fifth convolutional layer conv2_5, a second batch of normalization layers bn2-2, a second linear rectifying layer relu2-2, a sixth convolutional layer conv2_6, a third batch of normalization layers bn2-3 and a third linear rectifying layer relu2-3 which are sequentially connected;

S222, performing feature extraction on the first feature map F1 by using a second convolutional neural network to obtain a local feature map, wherein the second convolutional neural network comprises an adaptive average pooling layer adp_avg_pooling3, a first convolutional layer conv3_1, a first batch of normalization layers bn3-1, a first linear rectification layer relu3-1, a second convolutional layer conv3_2, a third convolutional layer conv3_3, a fourth convolutional layer conv3_4, a fifth convolutional layer conv3_5, a second batch of normalization layers bn3-2, a second linear rectification layer relu3-2, a sixth convolutional layer conv3_6, a third batch of normalization layers bn3-3, a third linear rectification layer relu3-3 and an up-sampling layer interpolate3 which are sequentially connected;

s223, calling a cat function to splice the global feature and the local feature in the channel dimension to obtain a second feature map F2.

Further, in the first convolutional neural network,

for the first convolution layer conv2_1, the window size of the convolution kernel is 1*1, the sliding step length is 1, and the first convolution layer conv2_1 is used for outputting 128 feature graphs;

for the second convolution layer conv2_2, the window size of the convolution kernel is 3*3, and the sliding step length is 1, so as to output 128 feature graphs;

for the third convolution layer conv2_3, the window size of the convolution kernel is 5*5, and the sliding step length is 1, so as to output 128 feature graphs;

For the fourth convolution layer conv2_4, the window size of the convolution kernel is 3*3, and the sliding step length is 1, so as to output 128 feature graphs;

for the fifth convolution layer conv2_5, the window size of the convolution kernel is 5*5, and the sliding step length is 1, so as to output 128 feature graphs;

for the sixth convolution layer conv2_6, the window size of the convolution kernel is 1*1, and the sliding step length is 1, so as to output 128 feature graphs;

for the first batch normalization layer bn2-1, the second batch normalization layer bn2-2 and the third batch normalization layer bn2-3, the batch normalization formula is the same as the batch normalization formula of the batch normalization layer bn4 of the operation layer, the initial value of the scaling parameter gamma is 1, and the initial value of the offset parameter beta is 0;

for the first linear rectifying layer relu2-1, the second linear rectifying layer relu2-2 and the third linear rectifying layer relu2-3, the linear rectifying processing formula is the same as the linear rectifying processing formula of the linear rectifying layer relu4 of the operation layer.

Further, in the second convolutional neural network,

for the adaptive average pooling layer adp_avg_pooling3, the adaptive average pooling formula is:

Output[i,j,c]=1/(pool_H*pool_W)*

wherein, output (i, j, c) represents the value when the row number is i, the column number is j and the channel number is c in the output characteristic diagram of the self-adaptive average pooling layer; input (m, n, c) represents values when the line number is m, the column number is n and the channel number is c in the input feature map of the adaptive average pooling layer; pool_h and pool_w are calculated by the following formulas:

pool_H=floor(H/output_H)

pool_W=floor(W/output_W)

Wherein floor (x) represents rounding x down; h is the length of the input feature map, and W is the width of the input feature map; output_h is the length of the output feature map, and output_h is 9; output_w is the width of the output feature map, output_w is 9, wherein the length of the feature map is the maximum value of the row number of the feature map, and the width of the feature map is the maximum value of the column number of the feature map;

for the first convolution layer conv3_1, the window size of the convolution kernel is 1*1, and the sliding step length is 1, so as to output 128 feature graphs;

for the second convolution layer conv3_2, the window size of the convolution kernel is 3*3, and the sliding step length is 1, so as to output 128 feature graphs;

for the third convolution layer conv3_3, the window size of the convolution kernel is 5*5, and the sliding step length is 1, so as to output 128 feature graphs;

for the fourth convolution layer conv3_4, the window size of the convolution kernel is 3*3, and the sliding step length is 1, so as to output 128 feature graphs;

for the fifth convolution layer conv3_5, the window size of the convolution kernel is 5*5, and the sliding step length is 1, so as to output 128 feature graphs;

for the sixth convolution layer conv3_6, the window size of the convolution kernel is 1*1, and the sliding step length is 1, so as to output 128 feature graphs;

For the first batch normalization layer bn3-1, the second batch normalization layer bn3-2 and the third batch normalization layer bn3-3, the batch normalization formula is the same as the batch normalization formula of the batch normalization layer bn4 of the operation layer, the initial value of the scaling parameter gamma is 1, and the initial value of the offset parameter beta is 0;

for the first linear rectifying layer relu3-1, the second linear rectifying layer relu3-2 and the third linear rectifying layer relu3-3, the linear rectifying processing formula is the same as the linear rectifying processing formula of the linear rectifying layer relu4 of the operation layer;

for the upsampling layer interaction 3, it is processed by a pyrach library function torch.nn.functional.interaction (input, size, mode), and input is a feature map output by the previous layer; size is the specified output size, 128 pixels by 128 pixels, mode is the interpolation algorithm used, and bilinear interpolation algorithm.

Preferably, in step S200, a vehicle counting model is trained according to a predicted density map and a real density map corresponding to each remote sensing image to be trained, so as to obtain a trained vehicle counting model, which includes:

s200a, determining a predicted density map of a current remote sensing image to be trained according to the current remote sensing image to be trained in a remote sensing image training set;

S200b, training a vehicle counting model by using a predicted density map and a real density map corresponding to the current remote sensing image to be trained, and determining a loss function of the vehicle counting model according to an output result of the vehicle counting model;

s200c, updating the weight parameters of the convolution kernels in all convolution layers and the values of the scaling parameters and the offset parameters in all batch normalization layers according to the loss function of the vehicle counting model;

s200d, judging whether all remote sensing images to be trained in the remote sensing image training set are trained, if yes, turning to step 200f, and if not, continuing to step 200e;

s200e, taking the next remote sensing image to be trained in the remote sensing image training set as the current remote sensing image to be trained, and turning to the step S200a;

s200f, judging whether the current training round of the vehicle counting model reaches the preset training round, if so, ending, if not, turning to step S200a,

the calculation formula of the loss function of the vehicle counting model is as follows:

wherein , is a Bayesian loss function, +.>Is a function of the count-up error,

bayesian loss functionThe definition is as follows:

wherein ,a Bayesian loss function corresponding to the current remote sensing image to be trained is calculated for the vehicle; n' is the total number of pixels in the predicted density map corresponding to the current remote sensing image to be trained, wherein the predicted density map corresponding to the current remote sensing image to be trained is an output result obtained by inputting the current remote sensing image to be trained into a vehicle counting model; c (C) _n The position of the nth pixel point in the predicted density map corresponding to the current remote sensing image to be trained; e [ C ] _n ]The density of the vehicle at the corresponding area of the nth pixel point in the predicted density map corresponding to the current remote sensing image to be trained; />Is->A loss function;

counting error functionThe definition is as follows:

wherein ,a counting error function corresponding to the current remote sensing image to be trained is obtained for the vehicle counting model; the method comprises the steps of predicting the total number of vehicles according to a predicted density map corresponding to a current remote sensing image to be trained, wherein the predicted density map corresponding to the current remote sensing image to be trained is an output result obtained by inputting the current remote sensing image to be trained into a vehicle counting model; and Y is the total number of vehicles determined according to the real density map corresponding to the current remote sensing image to be trained.

In another aspect, the present invention provides an electronic device, including:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the above-described vehicle counting method based on residual enhancement information.

In yet another aspect, the present invention provides a computer readable storage medium having a computer program stored therein, the computer program being executed by a processor to implement the above-described method of vehicle counting based on residual enhancement information.

The technical scheme provided by the embodiment of the disclosure has the beneficial effects that:

(1) According to the vehicle counting method based on residual enhancement information, a first feature map is determined through a feature extraction network VGG19 module, the first feature map F1 is an information representation after remote sensing image compression, and unnecessary features for counting in the remote sensing image are removed; and then determining a global feature map and a local feature map through the first convolutional neural network and the second convolutional neural network based on the first feature map F1, wherein the global feature map can reflect texture information of the vehicle, and the local feature map can reflect detailed feature information, namely angles, contours and other features of the vehicle. And obtaining a second feature map F2 by splicing the global feature map and the local feature map. The second feature map F2 fuses angle, contour information and texture information of the vehicle; then, determining a pooling query tensor Q, a pooling key tensor K and a pooling value tensor V of the spatial attention module by combining a plurality of operation layers of the spatial attention module with the first feature map, and carrying out information enhancement processing on the first feature map F1 based on a self-attention mechanism of the spatial attention module to obtain a third feature map F3, wherein vehicle information in the third feature map F3 is enhanced, and background information is weakened; then, the third feature map F3 and the second feature map F2 are fused to obtain a fused feature map, wherein the feature map comprises vehicle space information enhanced by the first feature map F1 and vehicle information of three dimensions, namely vehicle local information and vehicle global information, extracted by the first feature map F2, and the fusion of the vehicle space information and the vehicle local information is more robust and accurate relative to the information of the two dimensions; determining a predicted density map of the corresponding remote sensing image according to the fusion feature map; and finally, training a vehicle counting model according to the predicted density map and the real density map corresponding to each remote sensing image to obtain a trained vehicle counting model, and realizing vehicle counting based on any remote sensing image through the trained vehicle counting model.

(2) In the vehicle counting method, each operation layer comprises a convolution layer conv4, a linear rectification layer relu4, a batch normalization layer bn4 and an addition layer add4, and the convolution layer conv4 convolves the first feature map F1 to obtain a feature M. The linear rectifying layer relu4 can introduce nonlinear transformation, enhance the expression capacity and nonlinear fitting capacity of the neural network, and conduct linear rectification on the characteristic M to obtain the characteristic P. The batch normalization layer bn4 limits the result of the linear rectified feature M (i.e., feature P) to the interval of [ -1,1], and the batch normalization layer bn4 can further extract vehicle residual information lost in the convolution process to obtain feature N, where each element in feature N is kept at [ -1,1]. The addition layer add4 adds the feature M and the feature N, so that the original result feature M can be fused with the discarded residual information in the convolution process, the result of the feature M can be further reserved, the information N of the [ -1,1] interval can be additionally learned, and after Q, K, V is determined according to the addition of the feature M and the feature N, the value of each position in the third feature map F3 output by the spatial attention module can move in a smaller interval, namely the vehicle information in the third feature map F3 is strengthened, and the background information is weakened, so that the accurate extraction of the vehicle features of the remote sensing picture is realized, and the accuracy of the vehicle identification of the remote sensing scene is improved.

(3) In the vehicle counting method, in the process of fusing the second feature map F2 with the third feature map F3 output by the spatial attention module, the residual information discarded in the feature processing process is considered, and the first convolution layer conv5_1, the first linear rectification layer relu5-1, the first batch normalization layer bn5-1, the second convolution layer conv5_2, the second linear rectification layer relu5-2, the second batch normalization layer bn5-2 and the fourth convolution layer conv5_3 are used for processing the third feature map F3, so that each element in the obtained fourth feature map F4 can be kept at [ -1,1], and then F2 and F4 are fused through F4 x F2+F2 to combine the lost residual information in the learning process, thereby not only keeping the result of the second feature map F2, but also enabling the result of F2 to be additionally added with the fourth feature map F2 with the weight of the element interval at [ -1,1] to be the fourth feature map F4. By contrast analysis, only F2 is used as a density map output by the fusion feature map, errors of components of the density map and components of a real density map are in a range of < -1 > and 1 >, and the fusion feature map is obtained by additionally and forcedly adding a structure of F2 x F4, so that under the constraint of a loss function, the result of the finally output density map is more accurate than that of the directly output feature map F2, and in this way, the accuracy of vehicle counting in a remote sensing scene can be improved.

(4) In the vehicle counting method, the pixel size occupied by the vehicle in a data set is considered, in a first convolutional neural network, a first characteristic image F1 is extracted by using a first convolutional layer conv2_1, a first batch of normalized layers bn2-1, a first linear rectifying layer relu2-1, a second convolutional layer conv2_2, a third convolutional layer conv2_3, a fourth convolutional layer conv2_4, a fifth convolutional layer conv2_5, a second batch of normalized layers bn2-2, a second linear rectifying layer relu2-2, a sixth convolutional layer conv2_6, a third batch of normalized layers bn2-3 and a third linear rectifying layer relu2-3 which are sequentially connected, and the sizes of the vehicles are found to be in the pixel ranges of 1 x 1,3 x 3 and 5 x 5 by observing the data set, so that the convolutional kernels are selected in the range, and the large convolutional kernels are eliminated, and redundant information of the vehicle is avoided from being input.

(5) Also for the pixel size occupied by the vehicle in the data set, in the second convolutional neural network, the first feature map is extracted by using a second convolutional neural network composed of an adaptive average pooling layer adp_avg_pooling3, a first convolutional layer conv3_1, a first batch of normalized layers bn3-1, a first linear rectifying layer relu3-1, a second convolutional layer conv3_2, a third convolutional layer conv3_3, a fourth convolutional layer conv3_4, a fifth convolutional layer conv3_5, a second batch of normalized layers bn3-2, a second linear rectifying layer relu3-2, a sixth convolutional layer conv3_6, a third batch of normalized layers bn3-3, a third linear rectifying layer relu3-3 and an up-sampling layer inter-3 which are sequentially connected, and the size of the selected convolution kernel is within the pixel range of 1,3 x 3,5 x 5.

(6) By testing on three public data sets, the method has higher accuracy of counting results on the same data set; and compared with the first two data sets, the three data sets contain more scenes, so that the model has high robustness under different scenes, and can meet the requirement of target counting of remote sensing images under different environments. Can solve the problems of poor robustness, low counting precision and the like caused by uneven object distribution, large scale change and complex background distribution in the traditional remote sensing image

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow chart of a method of vehicle counting based on residual enhancement information according to an embodiment of the present invention;

FIG. 2 is a detailed flowchart of a step of determining a predicted density map corresponding to a remote sensing image to be trained in the vehicle counting method based on residual enhancement information shown in FIG. 1;

FIG. 3 is a schematic diagram of a first convolutional neural network involved in the step of determining a predicted density map corresponding to the remote sensing image to be trained shown in FIG. 2;

FIG. 4 is a schematic diagram of a second convolutional neural network involved in the step of determining a predicted density map corresponding to the remote sensing image to be trained shown in FIG. 2;

FIG. 5 is a schematic diagram of an operation layer and a residual attention mechanism of the spatial attention module involved in the step of determining a predicted density map corresponding to the remote sensing image to be trained shown in FIG. 2;

fig. 6 is a schematic diagram of residual information fusion involved in the step of determining a predicted density map corresponding to the remote sensing image to be trained shown in fig. 2.

Detailed Description

For the purposes of clarity, technical solutions and advantages of the present disclosure, the following further details the embodiments of the present disclosure with reference to the accompanying drawings.

Technical terms related to the embodiment of the present application are explained below:

feature extraction: feature extraction is a processing operation on an image that reduces the data dimension of some original input images or reassembles the original features for subsequent use.

Feature extraction network VGG19 module: VGG19 is a deep convolutional neural network structure proposed by the research team at oxford university in 2014. It is one of the VGG series, and is also one of the most widely used deep learning models at present. The VGG19 is very simple in structure and consists of 19 convolutional layers and a fully-connected layer. The convolution layer uses a 3x3 convolution kernel, and the full connection layer uses 4096 neurons.

Feature map: the feature map refers to the input or output of each convolutional layer in the convolutional neural network. The feature map generated by the convolution layer at a shallow position in the convolutional neural network may be low-level features such as edges, corner points and the like of the input image, and the feature map generated by the convolution layer at a deep position may be high-level features which are difficult to intuitively interpret. The output feature map of most convolution layers often has tens or hundreds of image channels (most color pictures shot by mobile phones contain three channels of red, green and blue), and the feature map can be represented by a symbol X, and can be represented as: x ε R (H W C), where R is the real set, H is the length of the feature map, W is the width of the feature map, and C is the number of channels of the feature map. X can also be expressed as: the values of the channel numbers are represented by the maximum values of the channel numbers, the maximum values of the channel numbers are represented by the maximum values of the channel numbers, and the channel numbers are represented by the maximum values of the channel numbers, respectively, [ [1z1_1,1z1_2, ], [ 1z1_2, ], and the number, 1z2_c ], [1z2_1,1z2_2, ], and the number 2z2_c ], [1z2_1,1z2_2, ], and the number 1z2_1, and the number 1z2_c ], [1zw_1,1zw_2, ], and the number 2z2_2, and the number 2zw_2, wherein the channel numbers are represented by the maximum values of the channel numbers, and the channel numbers are represented by the maximum values of the channel numbers.

Global feature map: the global feature map is used to represent global features of the image to be processed, such as color features, texture features and shape features.

Local feature map: the local feature map is used to represent local features of the image to be processed, such as features extracted from edges, corners, points, lines, curves, and regions of particular properties of the object, and so on.

Convolutional Neural Network (CNN): convolutional neural networks are a large class of methods in deep learning, consisting of several, tens, or even hundreds of convolutional layers stacked. Convolutional neural networks are particularly suitable for processing data in image format. By utilizing the convolutional neural network, the computer algorithm can have the capability of partial human vision, such as image classification and target detection, and can also realize image enhancement functions of image denoising, image amplification and the like.

Adaptive average pooling layer: adaptive averaging pooling is a technique for image processing that can extract important features in an image and transform them into a low-dimensional space, thereby improving various operations in image processing. It is accomplished primarily by an algorithm called "adaptive averaging pooling" which can effectively extract important features in the image and convert them into more easily processed values. The adaptive averaging pooling algorithm is based on multiple resampling of each pixel of the image, which first uses a set of convolution kernels to extract features of the image, then computes a set of values from the extracted features by an adaptive set of parameters, and finally sums the values into an output.

Spatial attention module: the spatial attention module is an important model in deep learning, can accurately analyze the spatial data of the image, and can help a machine to understand the content and the spatial structure of the visual image. In general, the spatial attention module has the characteristic that 1, the spatial attention model has local integration characteristic and can focus on a certain block in an image instead of the whole image. 2. Multiple level attention the spatial attention model may implement multiple levels of attention to better capture complex structures and patterns in the image. 3. And the depth is self-adaptive, namely, a layered structure is automatically formed along with the training of the spatial attention model, so that the representation of the image is more effective.

Referring to fig. 1, the vehicle counting method based on residual enhancement information of the present embodiment includes the steps of:

step S100, a plurality of remote sensing images to be trained are obtained, and each remote sensing image to be trained is adjusted to be uniform in size, so that a remote sensing image training set is obtained.

Each remote sensing image to be trained is a remote sensing image obtained by shooting different areas at different angles by the unmanned aerial vehicle, and the remote sensing images are subsequently used for training a vehicle counting model.

In this embodiment, each remote sensing image to be trained is adjusted to a size of 1024 pixels by 1024 pixels. By processing the remote sensing image to be trained to a 1024 pixel by 1024 pixel size, the impact of different image sizes on network performance and structure can be reduced. It will be appreciated that each remote sensing image to be trained may also be adjusted to 512 pixels by 512 pixels or 2048 pixels by 2048 pixels. In this embodiment, the size of all the remote sensing images to be trained can be adjusted to a specified size using the size () function in Opencv.

Step S200, determining a predicted density map corresponding to each remote sensing image to be trained in a remote sensing image training set, and training a vehicle counting model according to the predicted density map and the real density map corresponding to each remote sensing image to be trained to obtain a trained vehicle counting model;

the real density map corresponding to each remote sensing image to be trained can be obtained through the disclosed remote sensing image data set. In detail, the remote sensing image dataset comprises the position information of each vehicle on each remote sensing image to be trained, the position information of each vehicle is a real result of manual marking, and the real density map of the remote sensing image to be trained is obtained by processing the manually marked position information through Gaussian filtering.

It should be noted that, the real density map or the predicted density map of the remote sensing image to be trained includes a plurality of pixel points, and each pixel point corresponds to a certain geographic area of the remote sensing image to be trained. For the real density map, the value of a certain pixel point is the real value of the density of the vehicles at the region of the remote sensing image to be trained, which corresponds to the pixel point, and the sum of the values of all the pixel points is the real value of the total number of vehicles in the total region covered by the remote sensing image to be trained. For the predicted density map, the value of a certain pixel point is a predicted value of the density of vehicles in a corresponding region of the pixel point in the remote sensing image to be trained, and the sum of the values of all the pixel points is a predicted value of the total number of vehicles in the total region covered by the remote sensing image to be trained.

In this embodiment, after the training of the vehicle counting model is completed, the target remote sensing image of a certain target area is input to the trained vehicle counting model, the output of the trained vehicle counting model is the predicted density map corresponding to the target remote sensing image, and at this time, the total number of vehicles in the target area can be determined according to the sum of the values of all the pixels of the predicted density map of the target remote sensing image.

And S300, after receiving the vehicle counting request of the target area, acquiring a target remote sensing image corresponding to the vehicle counting request of the target area, and determining the number of vehicles of the target area based on the trained vehicle counting model.

The target area vehicle counting request is a request for counting vehicles in a certain target area. After receiving a target area vehicle counting request, firstly acquiring a target remote sensing image corresponding to the target area vehicle counting request, then adjusting the target remote sensing image to 1024 pixels in size, finally inputting the target remote sensing image with the adjusted size into a trained vehicle counting model, and acquiring an output result of the trained vehicle counting model, wherein the output result is the number of vehicles in the target area.

Specifically, referring to fig. 2, in step S200, determining a predicted density map corresponding to each remote sensing image to be trained in the remote sensing image training set specifically includes:

step S210, for each remote sensing image to be trained, feature extraction is performed on the remote sensing image to be trained by using a feature extraction network VGG19 module, and a first feature map F1 is obtained.

The input of the feature extraction network VGG19 module is a remote sensing image to be trained, the size of the remote sensing image to be trained is 1024 pixels, and the channel number is 3. The output of the feature extraction network VGG19 module is a first feature map F1, where the first feature map F1 is an image with a size of 128 pixels by 128 pixels and a channel number of 512.

The first feature map F1 output by the feature extraction network VGG19 module is an information representation of the remote sensing image to be trained after compression, and unnecessary features for vehicle counting in the remote sensing image to be trained are removed, so that subsequent information processing is facilitated.

Specifically, the feature extraction network VGG19 module in this step includes a first convolution layer conv1_1, a second convolution layer conv1_2, a first maximum pooling layer max_pooling1, a third convolution layer conv1_3, a fourth convolution layer conv1_4, a second maximum pooling layer max_pooling2, a fifth convolution layer conv1_5, a sixth convolution layer conv1_6, a seventh convolution layer conv1_7, an eighth convolution layer conv1_8, a third maximum pooling layer max_pooling3, a ninth convolution layer conv1_9, a tenth convolution layer conv1_10, an eleventh convolution layer conv1_11, a twelfth convolution layer conv1_12, a fourth maximum pooling layer max_pooling4, a thirteenth convolution layer conv1_13, a fourteenth convolution layer conv1_14, a fifteenth convolution layer conv1_15, a sixteenth convolution layer conv1_16, and a bilinear interpolation layer conv1_1_bilinear.

It should be noted that, each convolution layer is used for extracting the characteristics of the vehicle in the remote sensing image to be trained, and each maximum pooling layer is a pooling layer that performs pooling processing in a maximum pooling manner, and is used for reducing the dimension of the image and reducing the data volume to be processed of the next layer.

The following describes each component of the feature extraction network VGG19 module in detail.

For the first convolution layer conv1_1, the window size of its convolution kernel is 3×3, and the sliding step size is 1. The input of the first convolution layer conv1_1 is a remote sensing image to be trained, the size of which is 1024 pixels x 1024 pixels, and the channel number of which is 3; the output of the first convolution layer conv1_1 is 64 feature maps. The output of the first convolution layer conv1_1 serves as the input of the second convolution layer conv1_2.

For the second convolution layer conv1_2, the window size of the convolution kernel is 3×3, and the sliding step size is 1, so as to output 64 feature graphs, and the output of the second convolution layer conv1_2 is used as the input of the first max_pooling layer max_pooling 1.

For the first maximum pooling layer max_pooling1, the pooling window size is 2×2, the sliding step size is 2, and the output of the first maximum pooling layer max_pooling1 is used as the input of the third convolution layer conv1_3. The output of the first max_pooling layer max_pooling1 is a feature map with a size of 512 pixels by 512 pixels and a channel number of 64.

The window size of the convolution kernel of the third convolution layer conv1_3 is 3*3, and the sliding step size is 1, so that 128 feature maps can be output, and the output of the third convolution layer conv1_3 is used as the input of the fourth convolution layer conv1_4.

For the fourth convolution layer conv1_4, the window size of the convolution kernel is 3*3, the sliding step size is 1, and the window size is used for outputting 128 feature graphs, and the output of the fourth convolution layer conv1_4 is used as the input of the second maximum pooling layer max_pooling 2.

For the second maximum pooling layer max_pooling2, the pooling window size is 2×2, the sliding step size is 2, and the output of the second maximum pooling layer max_pooling2 is used as the input of the fifth convolution layer conv1_5. The output of the second max_pooling layer max_pooling2 is a feature map with a size of 256 pixels by 256 pixels and a channel number of 128.

The window size of the convolution kernel is 3*3 and the sliding step size is 1 for the fifth convolution layer conv1_5, which is used to output 256 feature maps, and the output of the fifth convolution layer conv1_5 is used as the input of the sixth convolution layer conv1_6.

The window size of the convolution kernel is 3*3 and the sliding step size is 1 for the sixth convolution layer conv1_6, and the output of the sixth convolution layer conv1_6 is used as the input of the seventh convolution layer conv1_7.

The window size of the convolution kernel is 3*3 and the sliding step size is 1 for the seventh convolution layer conv1_7, which is used to output 256 feature maps, and the output of the seventh convolution layer conv1_7 is used as the input of the eighth convolution layer conv1_8.

For the eighth convolution layer conv1_8, the window size of the convolution kernel is 3*3, the sliding step size is 1, and the output of the eighth convolution layer conv1_8 is used as the input of the third maximum pooling layer max_pooling 3.

For the third maximum pooling layer max_pooling3, the pooling window size is 2×2, the sliding step size is 2, and the output of the third maximum pooling layer max_pooling3 is used as the input of the ninth convolution layer conv1_9. The output of the third max_pooling layer max_pooling3 is a feature map with a size of 128 pixels by 128 pixels and a channel number of 256.

The window size of the convolution kernel is 3*3 and the sliding step size is 1 for the ninth convolution layer conv1_9, which is used to output 512 feature maps, and the output of the ninth convolution layer conv1_9 is used as the input of the tenth convolution layer conv1_10.

The tenth convolution layer conv1_10 has a window size of 3*3 and a sliding step of 1, and outputs 512 feature maps, and the output of the tenth convolution layer conv1_10 is used as the input of the eleventh convolution layer conv1_11.

The eleventh convolution layer conv1_11 has a window size of 3*3 and a sliding step of 1, and outputs 512 feature maps, and the output of the eleventh convolution layer conv1_11 is used as the input of the twelfth convolution layer conv1_12.

The twelfth convolution layer conv1_12 has a window size of 3*3 and a sliding step of 1, and is used for outputting 512 feature maps, and the output of the twelfth convolution layer conv1_12 is used as the input of the fourth max_pooling layer max_pooling 4.

For the fourth maximum pooling layer max_pooling4, the pooling window size is 2×2, the sliding step size is 2, and the output of the fourth maximum pooling layer max_pooling4 is used as the input of the thirteenth convolution layer conv1_13. The output of the fourth max_pooling layer max_pooling4 is a feature map with a size of 64 pixels by 64 pixels and a channel number of 512.

The thirteenth convolution layer conv1_13 has a window size of 3*3 and a sliding step of 1, and is configured to output 512 feature maps, and the output of the thirteenth convolution layer conv1_13 is used as the input of the fourteenth convolution layer conv1_14.

The window size of the convolution kernel is 3*3 and the sliding step size is 1 for the fourteenth convolution layer conv1_14, which is used to output 512 feature maps, and the output of the fourteenth convolution layer conv1_14 is used as the input of the fifteenth convolution layer conv1_15.

The fifteenth convolution layer conv1_15 has a window size of 3*3 and a sliding step of 1, and is configured to output 512 feature maps, and the output of the fifteenth convolution layer conv1_15 is used as an input to the sixteenth convolution layer conv1_16.

For the sixteenth convolution layer conv1_16, the window size of the convolution kernel is 3*3, the sliding step size is 1, and the window size is used for outputting 512 feature maps, and the output of the sixteenth convolution layer conv1_16 is used as the input of the bilinear interpolation layer unscamp_bilinear.

For the bilinear interpolation layer un-sample_bilinear 1, the scaling rate is 2, the input of the bilinear interpolation layer un-sample_bilinear is a feature map with the size of 64 pixels by 64 pixels and the channel number of 512, and the output of the bilinear interpolation layer un-sample_bilinear 1 is the first feature map F1. The first feature map F1 has a size of 128 pixels by 128 pixels and a channel number of 512.

In this embodiment, for the first max_pooling layer max_pooling1, the second max_pooling layer max_pooling2, the third max_pooling layer max_pooling3, and the fourth max pooling layer max_pooling4, the pooling formula of each max pooling layer is:

wherein c is the channel number of the input feature map of the maximum pooling layer in the channel dimension; h is a line number of the input feature map in the length dimension; w is the column number of the input feature map in the width dimension; k represents the size of a pooling window of the maximum pooling layer, and in this embodiment, the size of the pooling window is 2×2, i.e. k is 2; s is the step size of the maximum pooling layer, s is 2; k (k) _h and k_w Is a variable, k _h Is the length, k, of the pooling window _w Is the width of the pooling window, and k _h ∈[1,k]；k _w ∈[1,k]；h _s The line number corresponding to h in the input feature diagram of the maximum pooling layer when the step length is s; w (w) _s Inputting a column number corresponding to w in the feature map when the step length is s;

the channel number is c and the line number is c in the input feature diagram of the maximum pooling layerColumn number->A value of time; />When the channel number is c and the row number is h _s The column number is w _s When satisfy k _h ∈[1,k]And k is _w ∈[1,k]K of conditions _h And k is equal to _w All that is obtained in all combinations of (3)Maximum value of (2); />For channel number c, row number h _s Column number w _s Pooling window length k _h And a pooling window width of k _w And outputting the corresponding maximum pooling layer.

For the bilinear interpolation layer un-sample_bilinear 1, the output formula is:

in this example, Q ₁₁ 、Q ₁₂ 、Q ₂₁ and Q₂₂ Adjacent four points in the input feature diagram of the bilinear interpolation layer; q (Q) ₁₁ (x ₁ ,y ₁ ) On the upper left, Q ₁₂ (x ₁ ,y ₂ ) Lower left, Q ₁₂ (x ₂ ,y ₁ ) On the upper right, Q ₂₂ (x ₂ ,y ₂ ) In the lower right; f (Q) ₁₁ ) Is Q ₁₁ Pixel value of dot, f (Q ₁₂ ) Is Q ₁₂ Pixel value of dot, f (Q ₂₁ ) Is Q ₂₁ Pixel value of dot, f (Q ₂₂ ) Is Q ₂₂ Pixel values of the dots; f (x, y) is Q ₁₁ 、Q ₁₂ 、Q ₂₁ and Q₂₂ The pixel value corresponding to the intermediate point of these 4 points, (x, y) is the position coordinate corresponding to the intermediate point.

In the step, the feature extraction network VGG19 module uses 16 convolution layers, each convolution layer uses 3×3 convolution kernels, the accuracy of vehicle feature extraction in the remote sensing image to be trained is further improved by increasing the number of small convolution kernels and the depth of the network, and the effect of the same receptive field as that of the large convolution kernels is achieved in a small convolution superposition mode. In addition, the convolution kernels in all convolution layers in the feature extraction network VGG19 module do not perform dimension reduction operation, because dimension reduction can cause loss of vehicle information, and more vehicle information can be reserved without dimension reduction.

Step S220, for each remote sensing image to be trained, extracting a global feature map and a local feature map of the remote sensing image to be trained, and splicing the global feature map and the local feature map to obtain a second feature map F2.

Specifically, step S220 includes:

s221, performing feature extraction on the first feature map F1 by using the first convolutional neural network to obtain a global feature map.

In this example, the global feature map output by the first convolutional neural network can reflect texture information of the vehicle.

Referring to fig. 3, the first convolutional neural network includes a first convolutional layer conv2_1, a first batch of normalization layers bn2-1, a first linear rectifying layer relu2-1, a second convolutional layer conv2_2, a third convolutional layer conv2_3, a fourth convolutional layer conv2_4, a fifth convolutional layer conv2_5, a second batch of normalization layers bn2-2, a second linear rectifying layer relu2-2, a sixth convolutional layer conv2_6, a third batch of normalization layers bn2-3, and a third linear rectifying layer relu2-3, which are sequentially connected.

The batch normalization layer is used for normalizing and outputting input data, the mean value of each feature is approximately 0, and the variance is approximately 1, so that the input distribution of each layer is more stable, and the rapid convergence of a network is facilitated. The numerical stability of the intermediate output of the whole network can be ensured by continuously adjusting the intermediate output.

The linear rectifying layer (rected LinearUnit, reLU), also called a modified linear unit, is a commonly used activation function in artificial neural networks, and ReLu is used as a nonlinear activation function to map the negative number to zero and keep the positive number unchanged, thereby introducing nonlinear activation response. The deep neural network can learn more complex characteristic representation, and the expression capacity of the model is improved. Meanwhile, the linear rectifying layer keeps the gradient to be 1 in the positive number part, so that the problem of gradient disappearance is avoided, the network can better spread the gradient, and training and convergence of a model are promoted.

The components of the first convolutional neural network are described in detail below.

For the first convolution layer conv2_1, the window size of its convolution kernel is 1*1 and the sliding step size is 1. The input of the first convolution layer conv2_1 is a first feature map F1 with 128 pixels by 128 pixels and 512 channels; the output of the first convolution layer conv2_1 is 128 feature maps. The output of the first convolution layer conv2_1 serves as the input to the first normalization layer bn 2-1.

For the first batch of normalized layers bn2-1, its output is taken as input to the first linear rectifying layer relu 2-1.

For the first linear integer layer relu2-1, its output is taken as input to the second convolutional layer conv2_2.

For the second convolution layer conv2_2, the window size of its convolution kernel is 3*3 and the sliding step size is 1, for outputting 128 feature maps, the output of which is input to the third convolution layer conv2_3.

For the third convolution layer conv2_3, the window size of its convolution kernel is 5*5 and the sliding step size is 1, for outputting 128 feature maps, the output of which is input to the fourth convolution layer conv2_4.

For the fourth convolution layer conv2_4, the window size of its convolution kernel is 3*3 and the sliding step size is 1, for outputting 128 feature maps, the output of which is input to the fifth convolution layer conv2_5.

For the fifth convolution layer conv2_5, the window size of its convolution kernel is 5*5 and the sliding step size is 1, so as to output 128 feature maps, and its output is used as the input of the second normalization layer bn 2-2.

For the second batch of normalized layers bn2-2, its output is taken as input to the second linear rectifying layer relu 2-2.

For the second linear rectifying layer relu2-2, its output is taken as input to the sixth convolutional layer conv2_6.

For the sixth convolution layer conv2_6, the window size of its convolution kernel is 1*1 and the sliding step size is 1, for outputting 128 feature maps, the output of which is input to the third normalization layer bn 2-3.

For the third batch of normalized layers bn2-3, its output is taken as the output of the third linear rectifying layer relu 2-3.

For the third linear rectifying layer relu2-3, the output is the global feature map. The global feature map is a feature map with 128 pixels by 128 pixels and 128 channels.

In this embodiment, for the first, second, and third batch normalization layers bn2-1, bn2-2, and bn2-3, the batch normalization processing formula for each batch normalization layer is:

n is the number of all elements in the input feature map of the batch normalization layer; x is x _i I is the ith element of the input feature diagram of the batch normalization layer, and i is more than or equal to 1 and less than or equal to N; mu represents x _i Is the average value of (2); sigma represents x _i Standard deviation of (2); epsilon is a positive constant close to 0; x's' _i Table x _i Is a standard fraction of (2); gamma and beta are respectively a scaling parameter gamma and an offset parameter beta, the initial value of the scaling parameter gamma is 1, the initial value of the offset parameter beta is 0, y _i The ith element of the output feature map for the batch normalization layer.

In this embodiment, for the first linear rectifying layer relu2-1, the second linear rectifying layer relu2-2, and the third linear rectifying layer relu2-3, the linear rectifying processing formula of each linear rectifying layer is:

wherein x represents each element of the input feature map of the linear rectification layer;the linear rectification function is represented as an element corresponding to the input feature map element x in the output feature map of the linear rectification layer.

And S222, performing feature extraction on the first feature map F1 by using a second convolutional neural network to obtain a local feature map.

Referring to fig. 4, the second convolutional neural network includes an adaptive average pooling layer adp_avg_pooling3, a first convolutional layer conv3_1, a first linear rectifying layer bn3-1, a first linear rectifying layer relu3-1, a second convolutional layer conv3_2, a third convolutional layer conv3_3, a fourth convolutional layer conv3_4, a fifth convolutional layer conv3_5, a second normalizing layer bn3-2, a second linear rectifying layer relu3-2, a sixth convolutional layer conv3_6, a third normalizing layer bn3-3, a third linear rectifying layer relu3-3, and an upsampling layer inter 3, which are sequentially connected.

It should be noted that the adaptive averaging pooling layer (Adaptive Average Pooling Layer) is a pooling operation commonly used in deep learning, and is used to adjust the size of the input feature map to a fixed size, while retaining more feature information.

In this step example, the local feature map obtained by the pooling process of the adaptive average pooling layer by the second convolutional neural network can reflect the detailed feature information of the vehicle, that is, the features such as angle, contour, and the like.

The components of the second convolutional neural network are described in detail below.

For the adaptive average pooling layer adp_avg_pooling3, the output length is 9 and the width is 9. The input of the adaptive average pooling layer is a first feature map F1 with a size of 128 pixels by 128 pixels and a channel number of 512; the output of the adaptive averaging pooling layer serves as input to the first convolution layer conv3_1. The adaptive average pooling layer adp_avg_pooling3 has a 9 pixel by 9 pixel output and a 512 channel number feature map.

For the first convolution layer conv3_1, the window size of the convolution kernel is 1*1, the sliding step size is 1, and the window size is used for outputting 128 feature maps, and the output of the first convolution layer conv3_1 is used as the input of the first normalization layer bn 3-1.

For the first batch normalization layer bn3-1, the batch normalization formula is the same as that of the batch normalization layer of the first convolutional neural network, the initial value of the scaling parameter gamma is 1, and the initial value of the offset parameter beta is 0. The output of the first normalization layer bn3-1 serves as an input to the first linear rectification layer relu 3-1.

For the first linear rectifying layer relu3-1, the linear rectifying processing formula is the same as that of the linear rectifying layer of the first convolutional neural network. The output of the first linear integer layer relu3-1 serves as an input to the second convolutional layer conv3_2.

For the second convolution layer conv3_2, the window size of the convolution kernel is 3*3, the sliding step size is 1, and the output of the second convolution layer conv3_2 is used as the input of the third convolution layer conv3_3.

The window size of the convolution kernel of the third convolution layer conv3_3 is 5*5, and the sliding step size is 1, so that 128 feature maps can be output, and the output of the third convolution layer conv3_3 serves as the input of the fourth convolution layer conv3_4.

The window size of the convolution kernel is 3*3 and the sliding step size is 1 for the fourth convolution layer conv3_4, which is used to output 128 feature maps, and the output of the fourth convolution layer conv3_4 is used as the input of the fifth convolution layer conv3_5.

For the fifth convolution layer conv3_5, the window size of the convolution kernel is 5*5, the sliding step size is 1, and the output of the fifth convolution layer conv3_5 is used as the input of the second normalization layer bn3-2 to output 128 feature maps.

For the second batch normalization layer bn3-2, the batch normalization formula is the same as that of the batch normalization layer of the first convolutional neural network, the initial value of the scaling parameter gamma is 1, and the initial value of the offset parameter beta is 0. The output of the second normalization layer bn3-2 serves as an input to the second linear rectification layer relu 3-2.

For the second linear rectifying layer relu3-2, the linear rectifying processing formula is the same as that of the linear rectifying layer of the first convolutional neural network. The output of the second linear rectifying layer relu3-2 serves as input to the six convolutional layer conv3_6.

For the sixth convolution layer conv3_6, the window size of its convolution kernel is 1*1 and the sliding step size is 1, for outputting 128 feature maps, the output of which is input to the third normalization layer bn 3-3.

For the third batch normalization layer bn3-3, the batch normalization formula is the same as that of the batch normalization layer of the first convolutional neural network, the initial value of the scaling parameter gamma is 1, and the initial value of the offset parameter beta is 0. The output of the third batch normalization layer bn3-3 is taken as the output of the third linear rectification layer relu 3-3.

For the third linear rectifying layer relu3-3, the linear rectifying processing formula is the same as that of the linear rectifying layer of the first convolutional neural network. The output of the third linear rectifying layer relu3-3 serves as an upsampling layer input.

For the upsampling layer 3, the input is a feature map with a size of 9 pixels by 9 pixels and a channel number of 128, and the output is a local feature map, and the local feature map is a feature map with a size of 128 pixels by 128 pixels and a channel number of 128. The upsampling layer interpolation 3 needs to undergo two bilinear interpolations, and the specific process is completed by a pytorch library function torch.nn.functional. Interpolation (input, size, mode), and the function does not change the channel number of the feature map, and can output the length and width of the feature map according to the specified output size. Input is a feature map output by the previous layer, and in this example, its dimension is 9×9×128; size is the specified output size, in this example 128 x 128; mode is the interpolation algorithm used, which in this embodiment is a bilinear interpolation algorithm.

In this embodiment, for the first batch normalization layer bn3-1, the second batch normalization layer bn3-2, and the third batch normalization layer bn3-3 of the second convolutional neural network, the batch normalization formula of each batch normalization layer is the same as the batch normalization formula of the batch normalization layer of the first convolutional neural network, and the initial value of the scaling parameter γ is 1, and the initial value of the offset parameter β is 0.

In this embodiment, for the first linear rectifying layer relu3-1, the second linear rectifying layer relu3-2, and the third linear rectifying layer relu3-3 of the second convolutional neural network, the linear rectifying processing formula of each linear rectifying layer is the same as the linear rectifying processing formula of the linear rectifying layer of the first convolutional neural network.

In this embodiment, the pooling formula of the adaptive average pooling layer adp_avg_pooling3 is:

Output[i,j,c]=1/(pool_H*pool_W)*

the size of the input feature map of the adaptive averaging pooling layer adp_avg_pooling3 is h×w×c, H represents the length of the input feature map, W represents the width of the input feature map, and C represents the number of channels of the input feature map. The length of the feature map is the maximum value of the row number of the feature map, and the width of the feature map is the maximum value of the column number of the feature map. The size of the output feature map of the adaptive averaging pooling layer adp_avg_pooling3 is output_h×output_w×c, where output_h represents the length of the output feature map, output_w represents the width of the output feature map, and the number of channels of the output feature map is the same as the number of channels of the input feature map. output (i, j, c) represents values when the line number is i, the column number is j, and the channel number is c in the output feature map of the adaptive average pooling layer adp_avg_pooling3, input (m, n, c) represents values when the line number is m, the column number is n, and the channel number is c in the input feature map of the adaptive average pooling layer adp_avg_pooling3, and pool_h and pool_w each represent the size of the adaptive average pooling window. The sizes pool_h and pool_w of the adaptive average pooling window are adaptively adjusted according to the sizes output_h and output_w of the target output, and are calculated by the following formula:

pool_H=floor(H/output_H)

pool_W=floor(W/output_W)

Wherein floor (x) represents rounding x down; h is the length of the input feature map, and W is the width of the input feature map; output_h is the length of the output feature map, in this embodiment, output_h is 9; output_w is the width of the output feature map, and in this embodiment, output_w is 9. The length of the feature map is the maximum value of the row number of the feature map, and the width of the feature map is the maximum value of the column number of the feature map.

Thus, the parameters of the adaptive averaging pooling layer include the size of the input feature map H W C and the size of the target output H W C. In this embodiment, the height output_h of the target output is 9, the width output_w of the target output is 9, and the final output feature map can be obtained by calculating the sizes pool_h and pool_w of the pooling window.

In this embodiment, the size of the adaptive average pooling layer adp_avg_pooling3 is 9×9×512. The adaptive averaging pooling layer works as follows: first, according to the required output size, in the present model, we set the length of the output feature map to 9 and the width to 9, and calculate the size of the pooling window. Then, the input feature map is partitioned into a number of windows, each of which is equal in size to the pooled window. Then, an average value is calculated for each window, and the obtained average value is used as an element value of the corresponding position of the output.

S223, calling a cat function to splice the global feature and the local feature in the channel dimension to obtain a second feature map F2. The second feature map F2 has a size of 128 pixels by 128 pixels and a channel number of 256.

In this example, the second feature map F2 fuses the angle, contour information, and texture information of the vehicle.

In step S230, the first feature map F1 is input to three separate operation layers in the spatial attention module, so as to determine a pooled query tensor Q, a pooled key tensor K and a pooled value tensor V according to the convolution feature map output by each operation layer.

Specifically, referring to fig. 5, each operation layer includes a convolution layer conv4, a linear rectification layer relu4, a batch normalization layer bn4, and an addition layer add4. The input of the convolution layer conv4 is the first feature map F1, one output end of the convolution layer conv4 is connected to the input end of the linear rectification layer relu4, the output end of the linear rectification layer relu4 is connected to the input end of the batch normalization layer bn4, and the output end of the batch normalization layer bn4 is connected to the input end of the addition layer add4 together with the other output end of the convolution layer conv 4.

The respective constituent parts of each operation layer are described in detail below.

For the convolution layer conv4, the window size of the convolution kernel is 3*3, the sliding step length is 1, and the input of the convolution layer conv4 is a first feature map F1 with the size of 128 pixels by 128 pixels and the channel number of 512; the output of the convolution layer conv4 is 512 feature graphs, and the output of the convolution layer conv4 serves as the input of the linear rectification layer relu4 and the input of the addition layer add4.

For the linear rectifying layer relu4, the linear rectifying processing formula is the same as that of the linear rectifying layer of the first convolutional neural network. The output of the linear rectifying layer relu4 serves as input to the batch normalization layer bn 4.

For the batch normalization layer bn4, the batch normalization processing formula is the same as that of the batch normalization layer of the first convolutional neural network, and the initial value of the scaling parameter gamma is 1, and the initial value of the offset parameter beta is 0. The output of the batch normalization layer bn4 serves as an input to the addition layer add 4.

For the addition layer add4, it adds the output of the batch normalization layer bn4 and the output of the convolution layer conv4, and outputs a corresponding feature map, where the size of the feature map is 128 pixels by 128 pixels, and the channel number is 512.

In this embodiment, the convolution feature graphs output by the add layer add4 in the three independent operation layers are respectively used as a pooling query tensor Q, a pooling key tensor K, and a pooling value tensor V.

In this embodiment, the convolution layer conv4 convolves the first feature map F1 to obtain the feature M. The linear rectifying layer relu4 can introduce nonlinear transformation, enhance the expression capacity and nonlinear fitting capacity of the neural network, and conduct linear rectification on the characteristic M to obtain the characteristic P. The batch normalization layer bn4 limits the result of the feature M after linear rectification (the feature M after linear rectification is the feature P) to the interval of [ -1,1], and the batch normalization layer bn4 can further extract vehicle information lost in the convolution process of all convolution layers to obtain the feature N, wherein each element in the feature N is kept at [ -1,1]. The addition layer add4 adds the feature M and the feature N, so that the original result feature M can fuse the discarded information in the convolution process, and therefore not only can the result of the feature M be kept, but also the information N of the [ -1,1] interval can be additionally learned, so that after Q, K, V is determined according to the addition of the feature M and the feature N, each element in the third feature map F3 output by the spatial attention module can fluctuate in a smaller interval, namely, the vehicle information in the third feature map F3 is enhanced, and the background information is weakened.

Step S240, according to the pooling query tensor Q, the pooling key tensor K and the pooling value tensor V, the information enhancement processing is performed on the first feature map F1 based on the self-attention mechanism of the spatial attention module, so as to obtain a third feature map F3.

The input to the spatial attention module is a first feature map F1, 128 pixels in size, 128 pixels in number of channels 512. The output of the spatial attention module is a third feature map F3, whose dimensions are 128 x 512.

The self-attention mechanism of the spatial attention module is defined as the following formula:

wherein Attention is an Attention distribution matrix, Q represents a pooling Query tensor (Query), K represents a pooling Key tensor (Key), V represents a pooling Value tensor (Value), and Softmax represents a Softmax activation function;is a scaling factor for preventing the gradient from disappearing, in this case d and k are set to 1; t denotes a matrix transpose operation.

The Softmax function is defined as follows:

the Softmax function, or normalized exponential function, is a generalization of the logic function. The input feature map of the Softmax function isVector z= [ z_1, z_2, ], z_k corresponding to channel number c in the input feature map of the Softmax function]Wherein z_i represents an ith element in a vector corresponding to an input feature map channel number c of the Softmax function, z_j represents a jth element in a vector corresponding to an input feature map channel number c of the Softmax function, and the vector corresponding to the input feature map channel number c of the Softmax function contains k elements in total; x (z_i, c) represents the value of the i-th element in the vector z corresponding to the input feature map channel number c of the Softmax function, and x (z_j, c) represents the input feature map channel number of the Softmax function The value of the j-th element in the vector corresponding to c. Softmax (z_i, c) represents the value of the i-th element in the vector corresponding to the channel number c of the output feature map of the Softmax function, the value of each element of the output feature map of the Softmax layer ranges from 0 to 1, and the sum of all elements is 1.

The characteristic of convolution is that the spatial interaction and global relevance calculation can be carried out on the feature map, the scheme carries out Softmax processing on the generated pooling query tensor Q and the transposition of the pooling key tensor K, then multiplies the pooling query tensor Q and the pooling value tensor V by elements, the weighting of the features at different positions can be realized, in the training process, the information enhancement processing on the first feature map F1 can be realized due to the constraint of a loss function, the vehicle information is enhanced, the background information is weakened, and the third feature map F3 is obtained.

The input to the spatial attention module is a first feature map F1, 128 pixels in size, 128 pixels in number of channels 512. In this embodiment, Q is multiplied by the transpose of K, and since d and K are 1 in this embodiment, the product of Q and K transpose is directly processed by the softmax function, and then multiplied by V output, i.e. the output of the spatial attention module (third feature map F3), which is 128 pixels×128 pixels, and the number of channels is 512.

According to the embodiment, a self-attention mechanism with a global and dynamic receptive field is integrated into a convolutional neural network architecture, the self-attention mechanism of a spatial attention module is utilized to further enhance the first feature map F1, and the self-attention mechanism establishes dynamic weight parameters by carrying out relevant and irrelevant choices on information features of the first feature map F1 so as to strengthen weight information of a vehicle and weaken weight information of a background, thereby achieving accurate extraction of the features and improving accuracy of recognition of the vehicle in a remote sensing scene.

According to the method, the lost information in the convolution process is fused through the new spatial attention module, the feature is enhanced after feature extraction, different weights are given to different positions of input data, and the model pays more attention to important information, so that the weight information of the vehicle is higher.

S250, processing a third feature map F3 of each remote sensing image to be trained through a third convolutional neural network, multiplying the third feature map F3 with a second feature map F2 channel by channel to obtain a fifth feature map F5, and adding the fifth feature map F5 and the second feature map F2 channel by channel to fuse to obtain a fused feature map.

The step S250 specifically includes:

S251, the third characteristic diagram F3 is processed by using a third convolution neural network, and a fourth characteristic diagram F4 is obtained.

Referring to fig. 6, the third convolutional neural network includes a first convolutional layer conv5_1, a first linear rectifying layer relu5-1, a first batch of normalizing layers bn5-1, a second convolutional layer conv5_2, a second linear rectifying layer relu5-2, a second batch of normalizing layers bn5-2, and a third convolutional layer conv5_3, which are sequentially connected.

For the first convolution layer conv5_1, the window size of its convolution kernel is 3*3 and the sliding step size is 1. The input of the first convolution layer conv5_1 is a third feature map F1 with a size of 128 pixels by 128 pixels and a channel number of 512; 256 feature maps are output by the first convolutional layer conv5_1, the output of which is taken as input to the first linear rectifying layer relu 5-1.

For the first linear rectifying layer relu5-1, the linear rectifying processing formula is the same as that of the linear rectifying layer of the first convolutional neural network. The output of the first linear rectifying layer relu5-1 serves as input to the first normalization layer bn 5-1.

For the first batch normalization layer bn5-1, the batch normalization formula is the same as that of the batch normalization layer of the first convolutional neural network, the initial value of the scaling parameter gamma is 1, and the initial value of the offset parameter beta is 0. The output of the first normalization layer bn5-1 serves as an input to the second convolution layer conv5_2.

For the second convolution layer conv5_2, the window size 3*3 of its convolution kernel has a sliding step size of 1, and is used to output 128 feature maps, and the output of the second convolution layer conv5_2 is used as the input of the second linear rectification layer relu 5-2.

For the second linear rectifying layer relu5-2, the linear rectifying processing formula is the same as that of the linear rectifying layer of the first convolutional neural network. The output of the second linear rectifying layer relu5-2 serves as input to the second batch of normalizing layers bn 5-2.

For the second batch normalization layer bn5-2, the batch normalization formula is the same as that of the batch normalization layer of the first convolutional neural network, the initial value of the scaling parameter gamma is 1, and the initial value of the offset parameter beta is 0. The output of the second normalization layer bn5-2 serves as an input to the third convolution layer conv5_3.

For the third convolution layer conv5_3, the window size of the convolution kernel is 1*1, the sliding step length is 1, and the window size is used for outputting 1 feature map, the output of the third convolution layer conv5_3 is the fourth feature map F4, the size of the fourth feature map F4 is 128 pixels by 128 pixels, and the channel number is 1.

S252, multiplying the fourth characteristic diagram F4 and the second characteristic diagram F2 channel by channel to obtain a fifth characteristic diagram F5; the fifth feature map F5 and the second feature map F2 are added channel by channel to obtain a fused feature map (refer to fig. 6), the size of the fused feature map is 128 pixels by 128 pixels, and the number of channels is 256.

As described above, according to the characteristics of the spatial attention mechanism, the weight information of the vehicle in the third feature map F3 output by the spatial attention module is reinforced, and the weight information of the background is weakened. On the basis, after the third feature map F3 is processed by using a convolution layer, a linear rectifying layer and batch normalization, each element in the obtained fourth feature map F4 can be kept in the range of [ -1,1], and then the F2 and the F4 are fused by the F4 x F2+ F2, so that not only the result of the second feature map F2 can be kept, but also the result of the F2 can be additionally added with the F2 taking the fourth feature map F4 with the element interval of [ -1,1] as a weight, only the error of each component of the F2 as a density map and each component of the real density map is output in the range of [ -1,1], and compared with the F2, the F4 x F2+ F2, the generalization capability of the function is stronger, the result of the finally output density map can be relatively directly output with the feature map F2 under the constraint of a loss function, and the accuracy of the vehicle counting in a remote sensing mode can be improved by the steps.

The method fully considers the discarded residual information in the characteristic combining process, utilizes the network structures of the convolution layer, the linear rectifying layer and the batch normalization layer to fuse the residual information with the original result, can combine the residual information of the lost backbone network in the learning process, and further improves the accuracy of the counting result.

And S260, performing feature extraction on the fusion feature map by using a fourth convolution neural network to obtain a prediction density map.

Specifically, for each remote sensing image to be trained, downsampling is performed on the fusion feature map of the remote sensing image to be trained to obtain a prediction density map, wherein downsampling is realized by performing convolution operation on the fusion feature map by using a third convolution neural network.

The fourth convolutional neural network includes a first convolutional layer conv6_1, a first linear rectifying layer relu6-1, a second convolutional layer conv6_2, a second linear rectifying layer relu6-2, and a third convolutional layer conv6_3, which are sequentially connected.

The first convolution layer conv6_1 has a window size of 3*3 and a sliding step of 1, and is used for outputting 128 feature maps. The input of the first convolution layer conv6_1 is a fusion feature map with a size of 128 pixels by 128 pixels and a channel number of 256, and the output of the first convolution layer conv6_1 is a feature map with a size of 128 pixels by 128 pixels and a channel number of 128, and the feature map is used as the input of the first linear rectification layer relu 6-1.

The linear rectification processing formula of the first linear rectification layer relu6-1 is the same as that of the linear rectification layer of the first convolutional neural network. The output of the first linear integer layer relu6-1 serves as an input to the second convolutional layer.

And the window size of the convolution kernel of the second convolution layer conv6_2 is 3*3, and the sliding step size is 1, so as to output 64 feature maps. The output of the second convolution layer conv6_2 is a feature map with a size of 128 pixels by 128 pixels and a channel number of 64. This feature map serves as input for the second linear rectifying layer relu 6-2.

The linear rectification processing formula of the second linear rectification layer relu6-2 is the same as that of the linear rectification layer of the first convolutional neural network. The output of the second linear rectifying layer relu6-2 serves as an input to the third convolution layer.

The window size 1*1 of the convolution kernel of the third convolution layer conv6_3 is 1, and the sliding step size is used for outputting 1 feature map, and the output of the third convolution layer conv6_3 is a feature map with the size of 128×128 pixels and the channel number of 1, namely a prediction density map.

It should be noted that, the predicted density map of the remote sensing image to be trained includes a plurality of pixels, each pixel in the predicted density map corresponds to a certain region of the remote sensing image to be trained, the density of the vehicle at the corresponding region of the pixel in the remote sensing image to be trained can be predicted by the value of a certain pixel in the predicted density map, and the total number of vehicles in the total region covered by the remote sensing image to be trained can be predicted according to the sum of the values of all the pixels in the predicted density map. Therefore, the density condition of the vehicle distribution can be intuitively predicted by predicting the density map.

Specifically, in step S200, a vehicle counting model is trained according to a predicted density map and a real density map corresponding to each remote sensing image to be trained, so as to obtain a trained vehicle counting model, which specifically includes:

it can be appreciated that each time the vehicle counting model is trained using the current remote sensing image to be trained, a loss function of the vehicle counting model needs to be calculated.

S200c, updating the weight parameters of all convolution layers in a feature extraction network VGG19 module, a first convolution neural network, a second convolution neural network, a space attention module, a third convolution neural network and a fourth convolution neural network and the values of scaling parameters and offset parameters in all batch normalization layers according to a loss function of a vehicle counting model;

it should be noted that in this embodiment, for all the convolution layers, the weight parameters of the convolution kernels in the convolution layers are initialized using the python random initialization function. After the training of the vehicle counting model by using the current remote sensing image to be trained is completed, the weight parameters of the convolution kernel need to be updated, and at the moment, the weight parameters of the convolution kernel are updated by using the loss function.

at this time, when determining the predicted density map of the current remote sensing image to be trained, determining the parameters of all the convolution layers in the feature extraction network VGG19 module of the first feature map F1 of the current remote sensing image to be trained, the first convolution neural network of the global feature map, the second convolution neural network of the local feature map, the operation layer of the spatial attention module of the pooling query tensor Q and the pooling key tensor K and the pooling value tensor V, the third convolution neural network of the fusion feature map and the fourth convolution neural network of the predicted density map are updated, and at this time, determining the predicted density map of the current remote sensing image to be trained by using the updated network and module and the like.

And S200f, judging whether the current training round of the vehicle counting model reaches a preset training round, if so, ending, and if not, turning to the step S200a.

And if all the remote sensing images to be trained in the remote sensing image training set are trained, completing one round of training. And if the current training round does not reach the preset training round, performing next training. Each round of training must ensure that all the remote sensing images to be trained in the remote sensing image training set have been trained.

When the current training round of the vehicle counting model reaches the preset training round, the final training of the vehicle counting model is completed, and at the moment, any remote sensing image can be counted by using the vehicle counting model with the final training completed.

In this embodiment, the predetermined training round may be 1000. The calculation formula of the loss function of the vehicle counting model is as follows:

specifically, the calculation formula of the loss function of the vehicle count model in step S200b is:

wherein , is a Bayesian loss function, +.>Is a function of the count-up error,

bayesian loss functionThe definition is as follows:

wherein ,a Bayesian loss function corresponding to the current remote sensing image to be trained is calculated for the vehicle; n' is the total number of pixels in the predicted density map corresponding to the current remote sensing image to be trained, wherein the predicted density map corresponding to the current remote sensing image to be trained is an output result obtained by inputting the current remote sensing image to be trained into a vehicle counting model; c (C) _n The position of the nth pixel point in the predicted density map corresponding to the current remote sensing image to be trained; e [ C ] _n ]Vehicle at corresponding area for nth pixel point in predicted density map corresponding to current remote sensing image to be trained A density; />Is->A loss function;

counting error functionThe definition is as follows:

wherein ,a counting error function corresponding to the current remote sensing image to be trained is obtained for the vehicle counting model;the method comprises the steps of predicting the total number of vehicles according to a predicted density map corresponding to a current remote sensing image to be trained, wherein the predicted density map corresponding to the current remote sensing image to be trained is an output result obtained by inputting the current remote sensing image to be trained into a vehicle counting model; and Y is the total number of vehicles determined according to the real density map corresponding to the current remote sensing image to be trained.

It should be noted that, in this embodiment, each time the vehicle counting model is trained by using the current remote sensing image to be trained, a loss function of the vehicle counting model needs to be calculated.

The application also provides a vehicle counting system based on residual enhancement information, which comprises:

the remote sensing image training set acquisition module is used for acquiring a plurality of remote sensing images to be trained, and adjusting each remote sensing image to be trained into a uniform size so as to acquire a remote sensing image training set;

the vehicle counting model training module is used for determining a predicted density map corresponding to each remote sensing image to be trained in the remote sensing image training set, and training a vehicle counting model according to the predicted density map and the real density map corresponding to each remote sensing image to be trained to obtain a trained vehicle counting model;

The vehicle counting module is used for acquiring a target remote sensing image corresponding to the target area vehicle counting request after receiving the target area vehicle counting request, and determining the number of vehicles in the target area based on the trained vehicle counting model

The vehicle counting model training module specifically comprises:

the feature extraction network VGG19 module is used for extracting features of the remote sensing image to be trained to obtain a first feature map F1;

the global feature map and local feature map extracting module is used for independently carrying out feature extraction on the first feature map F1 according to different convolutional neural networks so as to obtain a global feature map and a local feature map;

the splicing module is used for splicing the global feature map and the local feature map to obtain a second feature map F2;

the parameter determining module is used for inputting the first characteristic diagram F1 into three independent operation layers to determine a pooling query tensor Q, a pooling key tensor K and a pooling value tensor V according to the convolution characteristic diagram output by each operation layer, wherein each operation layer comprises a convolution layer conv4, a linear rectification layer relu4, a batch normalization layer bn4 and an addition layer add4, the input of the convolution layer conv4 is the first characteristic diagram F1, one output end of the convolution layer conv4 is connected to the input end of the linear rectification layer relu4, the output end of the linear rectification layer relu4 is connected to the input end of the batch normalization layer bn4, and the output end of the batch normalization layer bn4 is connected to the input end of the addition layer add4 together with the other output end of the convolution layer conv 4;

The spatial attention module is configured by the pooling query tensor Q, the pooling key tensor K and the pooling value tensor V determined by the parameter determination module and is used for carrying out information enhancement processing on the first feature map F1 to obtain a third feature map F3;

the feature fusion module is used for fusing the third feature map F3 of the remote sensing image to be trained with the second feature map F2 to obtain a fused feature map;

and the prediction density map determining module is used for determining a corresponding prediction density map according to the fusion feature map of the remote sensing image to be trained.

The application also provides an electronic device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the above-described method of vehicle counting based on residual enhancement information.

The logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The present application also provides a computer-readable storage medium having a computer program stored thereon; a computer program is executed by a processor to implement the above-described residual enhancement information-based vehicle counting, the method comprising:

s250, fusing the third feature map F3 and the second feature map F2 of the remote sensing image to be trained to obtain a fused feature map;

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the preferred embodiments of the present disclosure is provided for the purpose of illustration only, and is not intended to limit the disclosure to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and principles of the disclosure.

Claims

1. A vehicle counting method based on residual enhancement information, comprising:

2. The method for counting vehicles based on residual enhancement information according to claim 1, wherein, in each operation layer of step S230,

；

where x represents each element of the input feature map of the linear rectifying layer,representing a linear rectification function;

；

3. The vehicle counting method based on residual enhancement information according to claim 1 or 2, wherein step S250 specifically includes:

4. The vehicle counting method based on residual enhancement information of claim 3, wherein, in the third convolutional neural network,

5. The method for counting vehicles based on residual enhancement information as claimed in claim 3, wherein the step S220 specifically comprises:

6. The method for vehicle counting based on residual enhancement information of claim 5, wherein, in the first convolutional neural network,

7. The vehicle counting method based on residual enhancement information of claim 5, wherein, in the second convolutional neural network,

Output[i,j,c]=1/(pool_H*pool_W)* ；

pool_H=floor(H/output_H) ；

pool_W=floor(W/output_W) ；

8. The method for vehicle counting based on residual enhancement information according to claim 5, wherein in step S200, a vehicle counting model is trained according to a predicted density map and a real density map corresponding to each remote sensing image to be trained, and the method for vehicle counting comprises:

；

wherein ,is a Bayesian loss function, +.>Is a function of the count-up error,

bayesian loss functionThe definition is as follows:

；

wherein ,a Bayesian loss function corresponding to the current remote sensing image to be trained is calculated for the vehicle; n' is the total number of pixels in the predicted density map corresponding to the current remote sensing image to be trained, wherein the predicted density map corresponding to the current remote sensing image to be trained is an output result obtained by inputting the current remote sensing image to be trained into a vehicle counting model; c (C) _n Is the position of the nth pixel point in the predicted density map corresponding to the current remote sensing image to be trainedPlacing; e [ C ] _n ]The density of the vehicle at the corresponding area of the nth pixel point in the predicted density map corresponding to the current remote sensing image to be trained; />Is->A loss function;

counting error functionThe definition is as follows:

；

wherein ,a counting error function corresponding to the current remote sensing image to be trained is obtained for the vehicle counting model; f (X) is the total number of vehicles predicted according to the predicted density map corresponding to the current remote sensing image to be trained, wherein the predicted density map corresponding to the current remote sensing image to be trained is an output result obtained by inputting the current remote sensing image to be trained into a vehicle counting model; and Y is the total number of vehicles determined according to the real density map corresponding to the current remote sensing image to be trained.

9. An electronic device, comprising:

a memory;

a processor; and

a computer program product comprising a computer program product,

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the residual enhancement information based vehicle counting method of any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program that is executed by a processor to implement the residual enhancement information-based vehicle counting method of any one of claims 1 to 8.