CN114154620A

CN114154620A - Training method of crowd counting network

Info

Publication number: CN114154620A
Application number: CN202111449140.3A
Authority: CN
Inventors: 赵怀林; 梁兰军; 周方波
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-08
Anticipated expiration: 2041-11-29
Also published as: CN114154620B

Abstract

The invention provides a training method of a crowd counting network, which effectively extracts multi-scale information through a plurality of groups of pyramid convolution kernels with different voidage rates and solves the problem of non-uniform sizes of human heads. By adding batch normalization to each layer of output, the problem of difficulty in training caused by increase of network depth is solved, and meanwhile, the depth of the network is further improved through a residual structure under the condition that the parameter number is not increased, so that the robustness is higher.

Description

Training method of crowd counting network

Technical Field

The invention relates to a training method of a crowd counting network.

Background

People counting and people positioning are an important task of current computer vision. However, in practical situations, the sizes of the heads in the pictures are not uniform due to the variable shooting angles, and severe occlusion and uneven crowd distribution exist in a high-density scene, which can increase the difficulty of crowd counting and crowd positioning tasks. The advent of convolutional neural networks provides a better method for the implementation of these two tasks, and it is usually desirable that the depth of the network is as deep as possible to better map the input-output relationship, but as the depth of the network increases, the amount of parameters increases, which makes the network difficult to train, and even makes the gradient explode or disappear.

Disclosure of Invention

The invention aims to provide a training method of a crowd counting network.

In order to solve the above problems, the present invention provides a training method for a crowd counting network, comprising:

step one, when people are counted, the method comprises the following steps:

step S1-1, the front end of an encoder of the network adopts the first ten layers of VGG16_ bn, the sample pictures are input to the front end of the encoder, and the feature information of the pictures is extracted;

step S1-2, the feature information of the extracted picture at the front end of the encoder is sent to the rear end of the encoder, and the rear end of the encoder adopts five residual error network structures with multi-scale cavity pyramid convolution aggregation modules for increasing the depth of the network and extracting the multi-scale feature information;

step S1-3, the extracted multi-scale feature information is sent to a crowd counting decoder for three times of up-sampling, and finally an estimated density map of one channel is output;

and step S1-4, calculating a first loss function according to the estimated density map of one channel, optimizing the network according to the first loss function, and calculating the evaluation index of the crowd count.

Further, in the above method, before the step S1-3, a decoder for generating the crowd count is further included, including:

the crowd counting decoder firstly adopts a convolution kernel of 1 multiplied by 1 to output a density map, then carries out transposition convolution for three times, and outputs an estimated density map with the same size as the input picture.

Further, in the above method, the first loss function in step S1-4 adopts L2 loss, which is expressed as:

where N refers to the number of pictures used for batch training, M_iIs a true density map of the network,

is an estimated density map of the network.

Further, in the above method, the method further comprises a second step, and when the crowd positioning is performed, the method comprises the following steps:

s2-1, adopting the first ten layers of VGG16_ bn at the front end of an encoder of the network, inputting a sample picture to the front end, and extracting the characteristic information of the picture;

step S2-2, the extracted feature information of the picture is sent to the rear end of a coder, and the rear end of the coder adopts five residual error network structures with multi-scale cavity pyramid convolution aggregation modules for increasing the depth of the network and extracting the multi-scale feature information;

step S2-3, sending the extracted multi-scale feature information to a crowd positioning decoder, and outputting a head pixel map and a background map, wherein the number of channels is 2;

and step S2-4, calculating a second loss function according to a head pixel image map and a background map, optimizing the network according to the second loss function, and calculating the evaluation index of crowd positioning.

Further, in the above method, the second loss function in step S2-4 adopts cross entropy loss, which is expressed as:

where j refers to the jth picture of the batch input, N refers to the number of pictures of the batch input, p refers to the p-th pixel of each picture, mxn is the pixel size of each picture, γ is used to increase the weight at the head point, and Y (X)_p) The prediction label generated by the p pixel of the j picture through the crowd positioning network takes values of 0, 1, psi (X)_p) Refers to a true value map of a data set.

Further, in the above method, before the step S2-3, a decoder for generating the crowd position further includes:

the crowd positioning decoder firstly adjusts the output size to 1/4 of the input image through a transposition convolution, then outputs a human head pixel image and a background image through a 1 multiplied by 1 convolution kernel, and finally outputs a positioning image with the same size as the input image through two times of bilinear interpolation adjustment.

Further, in the above method, in step S1-2 or S2-2, the extracted feature information of the picture is sent to the back end of the encoder, and the back end of the encoder adopts five residual error network structures with a multi-scale cavity pyramid convolution aggregation module to increase the depth of the network and extract the multi-scale feature information, including:

firstly, inputting the feature information of the picture extracted in the last step into a multi-scale cavity pyramid convolution aggregation module, firstly, grouping feature maps by the multi-scale cavity pyramid convolution aggregation module, extracting multi-scale information by adopting convolution kernels with different cavity rates of 3 multiplied by 3 in each group, wherein the number of channels in each group is 128; the cavity convolution enlarges the receptive field under the condition of not increasing the parameter number, the convolution kernels with different cavity rates are equivalent to a new convolution kernel, and the size of the equivalent convolution kernel is expressed as:

K＝k+(k-1)×(d-1)，

where K is the equivalent convolution kernel size, K is the cavity convolution kernel size, d is the cavity rate of the cavity convolution kernel, and is obtained by calculation, a 3 × 3 convolution with d ═ 2 corresponds to one 5 × 5 convolution kernel, a 3 × 3 convolution kernel with d ═ 3 corresponds to one 7 × 7 convolution kernel, and a 3 × 3 convolution kernel with d ═ 4 corresponds to one 9 × 9 convolution kernel;

then aggregating the multi-scale context information extracted by different groups of convolution kernels, processing the characteristic information through batch normalization, and finally sending the characteristic information to a ReLu activation function, wherein the expression is as follows:

wherein the content of the first and second substances,

representing an ith multi-scale hole pyramidThe output characteristic diagram of the nth group convolution kernel of the tower convolution aggregation module,

indicating aggregation of extracted multi-scale information, F_bnDenotes batch normalization, R_reluIndicating that the feature information is sent to the ReLu activation function.

Further, in the above method, in the 5 multi-scale hole pyramid convolution aggregation modules of step S1-2 or S2-2,

the first three multi-scale void pyramid convolution aggregation modules divide feature information of an input picture into four groups, and perform convolution with a 3 × 3 convolution kernel with a void ratio D of 1, 2, 3, 4 respectively, wherein the number of output channels is 512;

the fourth multi-scale cavity pyramid convolution aggregation module divides the input feature map into two groups, the cavity rate D is 1 and 2, and the number of output channels is 256;

the fifth multi-scale void pyramid convolution aggregation module only adopts a group of 3 multiplied by 3 convolution kernels with the void rate of 1, and the number of output channels is 128.

Further, in the above method, before step S1-2 or S2-2, generating a residual network structure with a multi-scale cavity pyramid convolution aggregation module includes:

firstly, in order to increase the network depth, respectively adding 1 × 1 convolution layers at two ends of a cavity pyramid cavity convolution module, carrying out batch normalization, and inputting the normalized convolution layers to a Rule activation function; naming the structure as a stack, the output map of the stack is represented as:

F(X)＝δ₂(Φ(δ₁(X)))，

where X represents the output of the previous layer as input to the residual network, δ_i(. to) represents the output mapping of the input through a 1 × 1 convolution kernel, batch normalization and ReLu activation function, phi (δ)₁(X)) is represented by delta₁(X) as input, obtaining an output mapping by a multi-scale void pyramid convolution aggregation module;

and then selecting the structure of the residual error network in a form similar to a switch according to the relation between the output channel number and the input channel number of the stacked layers.

Further, in the above method, selecting a structure of the residual error network in a form similar to a switch according to a relationship between the number of output channels and the number of input channels of the stack layer includes:

for the first three multi-scale void pyramid convolution aggregation modules, the number of input channels is equal to that of output channels, and the network directly adds the output and the input of the stack layer to obtain the output, which is defined as:

Y＝F(X)+X，

y represents the output of the residual network;

for the fourth and fifth multi-scale void pyramid convolution aggregation modules, the number of input channels is not equal to the number of output channels, the network sends the input to a convolution kernel of 1 × 1 to change the number of channels, and then adds the number of channels to the output of the stack layer, which is defined as:

Y＝F(X)+δ₃(X)，

where Y represents the output of the residual network, δ₃(X) represents the output mapping of the input signature through a 1X 1 convolution kernel, batch normalization and ReLu activation function.

Compared with the prior art, the scale perception crowd counting network with the pyramid cavity convolution is used for increasing the network depth and extracting multi-scale information. The specific steps of the invention for people counting are as follows: inputting a picture, firstly extracting the feature information of the picture through the front end VGG16_ bn of the encoder, and then sequentially inputting the extracted feature map into five residual error networks with cavity pyramid convolution to extract multi-scale information. The network structure is firstly passed through a 1 x 1 convolution kernel, and is subjected to batch normalization and input to an activation function. And then dividing the extracted characteristic graphs into a plurality of groups, extracting multi-scale information by using a plurality of groups of pyramid convolutions with different voidage in a multi-scale void pyramid convolution aggregation module respectively, aggregating the characteristic information, performing batch normalization, and inputting the aggregated characteristic information into an activation function. The first three modules divide the input feature map into four groups, the void ratio D is 1, 2, 3, 4, the fourth module divides the input feature map into two groups, the void ratio D is 1, 2, and the fifth module only adopts a group of 3 × 3 convolution kernels with the void ratio of 1. After the module outputs the characteristic information, the input and the output are added through residual connection to obtain the output mapping of the residual network. And finally, inputting the extracted multi-scale information into a decoder for up-sampling, and outputting an estimated density map. The method comprises the following specific steps of: the pictures are first input to the encoder of the network, whose network structure is the same as that of the people counting. Then the decoder firstly adjusts the output size through a transposition convolution to 1/4 of the input image, then outputs a human head pixel image and a background image through a convolution kernel of 1 multiplied by 1, and finally outputs a positioning image with the same size as the input image through two times of bilinear interpolation adjustment.

The method provided by the invention effectively extracts multi-scale information through a plurality of groups of pyramid convolution kernels with different voidage rates, and solves the problem of non-uniform head size. By adding batch normalization to each layer of output, the problem of difficulty in training caused by increase of network depth is solved, and meanwhile, the depth of the network is further improved through a residual structure under the condition that the parameter number is not increased, so that the robustness is higher.

Drawings

FIG. 1 is a network architecture diagram of population count according to one embodiment of the present invention;

FIG. 2 is a diagram of a network architecture for crowd positioning according to an embodiment of the present invention;

FIG. 3 is a block diagram of a multi-scale void pyramid convolution aggregation module according to an embodiment of the present invention;

fig. 4 is a residual network structure of the multi-scale cavity pyramid convolution aggregation module according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

In the face of the problems of low network depth and multiple scale changes in the prior art, the invention aims to design a method capable of extracting multi-scale information and increasing the network depth.

The invention provides a training method of a crowd counting network, which comprises the following steps: s1, when people group counting is carried out, the network framework is shown in figure 1, and the method comprises the following steps:

the front end of an encoder of the S1-1 network adopts the first ten layers of VGG16_ bn, the sample pictures are input to the front end of the encoder, and the feature information of the pictures is extracted.

S1-2, the feature graph extracted from the front end is sent to the rear end of the encoder, and the residual error network structure with five multi-scale cavity pyramid convolution aggregation modules is adopted to increase the depth of the network and extract multi-scale information;

s1-3, the extracted multi-scale feature information is sent to a decoder for three times of up-sampling, and finally an estimated density map of a channel is output. (ii) a

S1-4, calculating a loss function according to the estimated density graph, and optimizing the network.

S2, when people positioning is carried out, the network framework is shown in figure 2, and the method comprises the following steps:

s2-1, inputting the sample picture to the front end of an encoder with the same crowd counting structure, and extracting the characteristic information of the picture;

s2-2, sending the extracted feature map to the rear end of the encoder, and extracting multi-scale feature information;

s2-3, sending the extracted feature information to a decoder, and outputting a human head pixel image and a background image, wherein the number of channels is 2;

s2-4 optimizes the network according to the loss function.

The multi-scale cavity pyramid convolution aggregation module in the steps S1-2 and S2-2 of the crowd counting and crowd positioning specifically operates as follows:

firstly, inputting the feature map extracted in the last step into a multi-scale cavity convolution aggregation module, as shown in fig. 3, C represents aggregation, BR represents batch normalization and ReLU, and MDC represents a multi-scale cavity golden sub-tower convolution aggregation module. The module firstly groups the feature maps, each group adopts convolution kernels with different void rates and 3 multiplied by 3 to extract multi-scale information, and the number of channels of each group is 128. The cavity convolution can enlarge the receptive field under the condition of not increasing the parameter number, the convolution kernels with different cavity rates can be equivalent to a new convolution kernel, and the size of the equivalent convolution kernel can be expressed as:

K＝k+(k-1)×(d-1) (1)

where K is the equivalent convolution kernel size, K is the void convolution kernel size, and d is the void rate of the void convolution kernel. As a result of calculation, the 3 × 3 convolution with d ═ 2 corresponds to one convolution kernel of 5 × 5, the 3 × 3 convolution kernel with d ═ 3 corresponds to one convolution kernel of 7 × 7, and the 3 × 3 convolution kernel with d ═ 4 corresponds to one convolution kernel of 9 × 9.

Then aggregating the multi-scale context information extracted by different groups of convolution kernels, processing the feature information through batch normalization, and finally sending the feature information to a ReLu activation function, which can be expressed as:

wherein

An output feature map representing the nth set of convolution kernels of the ith multi-scale hole pyramid convolution aggregation module,

The scale perception crowd counting network with pyramid cavity convolution has a network structure which totally adopts 5 multi-scale cavity pyramid convolution aggregation modules. The first three modules divide the input feature map into four groups, and the four groups are respectively convoluted with 3 × 3 convolution kernels with the void ratios D being 1, 2, 3 and 4, and the number of output channels is 512. The fourth module divides the input feature map into two groups, the void ratio D is 1, 2, and the number of output channels is 256. The fifth module only uses a set of 3 × 3 convolution kernels with a void rate of 1, and the number of output channels is 128.

Further, the residual network structure with the multi-scale cavity pyramid convolution aggregation module in steps S1-2 and S2-2, as shown in fig. 4, specifically operates as follows:

firstly, in order to increase the network depth, 1 × 1 convolution layers are respectively added at two ends of a cavity pyramid cavity convolution module, batch normalization is carried out, and the normalized convolution layers are input into a Rule activation function. The present invention names this structure as a stack layer, and the output mapping of the stack layer can be expressed as:

F(X)＝δ₂(Φ(δ₁(X))) (3)

x represents the output of the previous layer as input to the residual network, delta_i(. to) represents the output mapping of the input through a 1 × 1 convolution kernel, batch normalization and ReLu activation function, phi (δ)₁(X)) is represented by delta₁(X) as input the output map obtained by the multi-scale hole pyramid convolution aggregation module (see in particular equation (2)).

Then, according to the relation between the output channel number and the input channel number of the stacked layer, a structure of a residual error network is selected in a form similar to a switch, and the specific operation is as follows:

for the first three pyramid hole convolution modules, the number of input channels is equal to that of output channels, and the network directly adds the output and the input of the stack layer to obtain an output, which is defined as:

Y＝F(X)+X (4)

y denotes the output of the residual network.

For the fourth and fifth pyramid hole convolution modules, the number of input channels is not equal to the number of output channels, the network first sends the input to a convolution kernel of 1 × 1 to change the number of channels, and then adds the number of channels to the output of the stack layer, which is defined as:

Y＝F(X)+δ₃(X) (5)

where Y denotes the output of the residual network, δ₃(X) represents the output mapping of the input signature through a 1X 1 convolution kernel, batch normalization and ReLu activation function.

Furthermore, the decoder in the scale-aware population counting network with pyramid hole convolution has different upsampling modes. The crowd counting decoder firstly adopts a convolution kernel of 1 multiplied by 1 to output a density map, then carries out transposition convolution for three times, and outputs an estimated density map with the same size as the input picture. The crowd positioning decoder firstly adjusts the output size to 1/4 of the input image through a transposition convolution, then outputs a human head pixel image and a background image through a 1 multiplied by 1 convolution kernel, and finally outputs a positioning image with the same size as the input image through two times of bilinear interpolation adjustment.

Further, the loss function of S1-4 in the crowd-counted network takes the L2 loss, which can be expressed as:

is an estimated density map of the network.

Further, the loss function of S2-4 in the crowd-sourced network employs cross-entropy loss, which can be expressed as:

where j refers to the jth picture of the batch input, N refers to the number of pictures of the batch input, p refers to the p-th pixel of each picture, mxn is the pixel size of each picture, γ is used to increase the weight at the head point, Y (X)_p) The prediction label generated by the pth pixel of the jth picture through the crowd positioning network is taken as 0, 1 (human head). Psi (X)_p) Refers to a true value map of a data set.

Meanwhile, due to multi-scale change in the image, multi-scale information cannot be extracted by adopting a single convolution kernel, but the multi-scale information can be extracted by the void pyramid convolution under the condition of not increasing the number of parameters, so that the performance of the network is improved.

The invention needs to estimate the total number of people in a picture, predict the position of a head point in the picture and generate a positioning picture, and the specific details are as follows:

estimating the total number of people in a picture

Knowing the pixel value and label of a picture, inputting the picture to the front end of a network encoder, and extracting the characteristic information of the picture through the first ten layers of VGG16_ bn.

Sending the extracted feature information to a first residual error network with a hollow pyramid convolution, wherein the specific implementation details are as follows:

1) the extracted feature information is first sent to a first residual network with a hole pyramid convolution module. The characteristic information is firstly convoluted through a convolution kernel of 1 × 1, and then batch normalization and ReLu activation function are carried out.

2) The feature information is divided into 4 groups, each group of feature maps is sent to pyramid convolution with different void rates, the convolution kernel size is 3 x 3, and the void rates are respectively D1, 2, 3 and 4. Then aggregating the multi-scale context information extracted by different groups of convolution kernels, processing the feature information through batch normalization, and finally sending the feature information to a ReLu activation function, which can be expressed as:

wherein

An output feature map representing the nth set of convolution kernels of the 1 st multi-scale hole pyramid convolution aggregation module,

3) And (3) convolving the extracted multi-scale information by a 1 × 1 convolution kernel, and performing batch normalization and activating functions, wherein the output mapping course of the first stack is represented as:

F(X)＝δ₂(Φ(δ₁(X))) (2)

4) directly adding the input feature map and the output of the multi-scale cavity pyramid convolution aggregation module to obtain the output mapping of the first residual error network with the multi-scale cavity pyramid convolution aggregation module, which can be expressed as:

Y＝F(X)+X (3)

and continuously inputting the multi-scale information extracted by the first residual error network into the second residual error network to extract the characteristic information, and repeating the steps until the fifth residual error network extracts the multi-scale characteristic information.

And finally, inputting the extracted multi-scale information into a decoder, outputting a density map by adopting a 1 × 1 convolution kernel, outputting an estimated density map with the same size as the input picture through three times of transposition convolution, and obtaining the predicted number of people through integral summation.

Second, estimating the head position in the picture

Knowing the pixel value and the truth diagram of a picture, inputting the picture into a network encoder, extracting multi-scale information of the picture, wherein the implementation details of the multi-scale information are the same as those of an encoder of a crowd counting network, and finally outputting the multi-scale information.

And then inputting the extracted characteristic information into a crowd positioning decoder, wherein the size of the extracted characteristic information is 1/4 of the input image after a transposition convolution adjustment, then a human head pixel image and a background image are output through a 1 x 1 convolution kernel, and finally a positioning image with the same size as the input image is output through two times of bilinear interpolation adjustment.

In summary, the scale-aware crowd counting network with pyramid hole convolution provided by the invention is used for increasing the network depth and extracting multi-scale information. The specific steps of the invention for people counting are as follows: inputting a picture, firstly extracting the feature information of the picture through the front end VGG16_ bn of the encoder, and then sequentially inputting the extracted feature map into five residual error networks with cavity pyramid convolution to extract multi-scale information. The network structure is firstly passed through a 1 x 1 convolution kernel, and is subjected to batch normalization and input to an activation function. And then dividing the extracted characteristic graphs into a plurality of groups, extracting multi-scale information by using a plurality of groups of pyramid convolutions with different voidage in a multi-scale void pyramid convolution aggregation module respectively, aggregating the characteristic information, performing batch normalization, and inputting the aggregated characteristic information into an activation function. The first three modules divide the input feature map into four groups, the void ratio D is 1, 2, 3, 4, the fourth module divides the input feature map into two groups, the void ratio D is 1, 2, and the fifth module only adopts a group of 3 × 3 convolution kernels with the void ratio of 1. After the module outputs the characteristic information, the input and the output are added through residual connection to obtain the output mapping of the residual network. And finally, inputting the extracted multi-scale information into a decoder for up-sampling, and outputting an estimated density map. The method comprises the following specific steps of: the pictures are first input to the encoder of the network, whose network structure is the same as that of the people counting. Then the decoder firstly adjusts the output size through a transposition convolution to 1/4 of the input image, then outputs a human head pixel image and a background image through a convolution kernel of 1 multiplied by 1, and finally outputs a positioning image with the same size as the input image through two times of bilinear interpolation adjustment.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for training a people counting network, comprising: step one, when people are counted, the method comprises the following steps:

2. A method for training a people counting network as claimed in claim 1, wherein, before step S1-3, further comprising a decoder for generating people counts, comprising:

3. The method for training a people counting network according to claim 1, wherein the first loss function in step S1-4 is represented by L2 loss as:

is an estimated density map of the network.

4. The method for training a people counting network according to claim 1, further comprising a second step of performing people positioning, comprising the steps of:

5. The method for training a people counting network according to claim 4, wherein the second loss function in step S2-4 is a cross-entropy loss expressed as:

6. The method for training a people counting network according to claim 4, wherein before step S2-3, further comprising a decoder for generating people positions, comprising:

7. The training method of people counting network as claimed in claim 1 or 4, wherein the step S1-2 or S2-2 is to send the extracted feature information of the picture to the back end of the encoder, and the back end of the encoder adopts five residual network structures with multi-scale hole pyramid convolution aggregation modules for increasing the depth of the network and extracting the multi-scale feature information, and comprises:

K＝k+(k-1)×(d-1)，

wherein the content of the first and second substances,

8. The training method of people counting network of claim 1 or 4, wherein in the 5 multi-scale hole pyramid convolution aggregation modules of step S1-2 or S2-2,

9. A training method of a people counting network according to any of the claims 1 or 4, characterized in that before step S1-2 or S2-2, further comprising generating a residual network structure with a multi-scale hole pyramid convolution aggregation module, comprising:

F(X)＝δ₂(Φ(δ₁(X)))，

10. A training method for a people counting network according to any one of claim 9, wherein the structure of the residual network is selected in a switch-like manner according to the relation between the output channel number and the input channel number of the stack layer, and the method comprises:

Y＝F(X)+X，

y represents the output of the residual network;

Y＝F(X)+δ₃(X)，