CN114154620A - Training method of crowd counting network - Google Patents

Training method of crowd counting network Download PDF

Info

Publication number
CN114154620A
CN114154620A CN202111449140.3A CN202111449140A CN114154620A CN 114154620 A CN114154620 A CN 114154620A CN 202111449140 A CN202111449140 A CN 202111449140A CN 114154620 A CN114154620 A CN 114154620A
Authority
CN
China
Prior art keywords
convolution
network
scale
output
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111449140.3A
Other languages
Chinese (zh)
Other versions
CN114154620B (en
Inventor
赵怀林
梁兰军
周方波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute of Technology
Original Assignee
Shanghai Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Institute of Technology filed Critical Shanghai Institute of Technology
Priority to CN202111449140.3A priority Critical patent/CN114154620B/en
Publication of CN114154620A publication Critical patent/CN114154620A/en
Application granted granted Critical
Publication of CN114154620B publication Critical patent/CN114154620B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a training method of a crowd counting network, which effectively extracts multi-scale information through a plurality of groups of pyramid convolution kernels with different voidage rates and solves the problem of non-uniform sizes of human heads. By adding batch normalization to each layer of output, the problem of difficulty in training caused by increase of network depth is solved, and meanwhile, the depth of the network is further improved through a residual structure under the condition that the parameter number is not increased, so that the robustness is higher.

Description

Training method of crowd counting network
Technical Field
The invention relates to a training method of a crowd counting network.
Background
People counting and people positioning are an important task of current computer vision. However, in practical situations, the sizes of the heads in the pictures are not uniform due to the variable shooting angles, and severe occlusion and uneven crowd distribution exist in a high-density scene, which can increase the difficulty of crowd counting and crowd positioning tasks. The advent of convolutional neural networks provides a better method for the implementation of these two tasks, and it is usually desirable that the depth of the network is as deep as possible to better map the input-output relationship, but as the depth of the network increases, the amount of parameters increases, which makes the network difficult to train, and even makes the gradient explode or disappear.
Disclosure of Invention
The invention aims to provide a training method of a crowd counting network.
In order to solve the above problems, the present invention provides a training method for a crowd counting network, comprising:
step one, when people are counted, the method comprises the following steps:
step S1-1, the front end of an encoder of the network adopts the first ten layers of VGG16_ bn, the sample pictures are input to the front end of the encoder, and the feature information of the pictures is extracted;
step S1-2, the feature information of the extracted picture at the front end of the encoder is sent to the rear end of the encoder, and the rear end of the encoder adopts five residual error network structures with multi-scale cavity pyramid convolution aggregation modules for increasing the depth of the network and extracting the multi-scale feature information;
step S1-3, the extracted multi-scale feature information is sent to a crowd counting decoder for three times of up-sampling, and finally an estimated density map of one channel is output;
and step S1-4, calculating a first loss function according to the estimated density map of one channel, optimizing the network according to the first loss function, and calculating the evaluation index of the crowd count.
Further, in the above method, before the step S1-3, a decoder for generating the crowd count is further included, including:
the crowd counting decoder firstly adopts a convolution kernel of 1 multiplied by 1 to output a density map, then carries out transposition convolution for three times, and outputs an estimated density map with the same size as the input picture.
Further, in the above method, the first loss function in step S1-4 adopts L2 loss, which is expressed as:
Figure BDA0003379384040000021
where N refers to the number of pictures used for batch training, MiIs a true density map of the network,
Figure BDA0003379384040000022
is an estimated density map of the network.
Further, in the above method, the method further comprises a second step, and when the crowd positioning is performed, the method comprises the following steps:
s2-1, adopting the first ten layers of VGG16_ bn at the front end of an encoder of the network, inputting a sample picture to the front end, and extracting the characteristic information of the picture;
step S2-2, the extracted feature information of the picture is sent to the rear end of a coder, and the rear end of the coder adopts five residual error network structures with multi-scale cavity pyramid convolution aggregation modules for increasing the depth of the network and extracting the multi-scale feature information;
step S2-3, sending the extracted multi-scale feature information to a crowd positioning decoder, and outputting a head pixel map and a background map, wherein the number of channels is 2;
and step S2-4, calculating a second loss function according to a head pixel image map and a background map, optimizing the network according to the second loss function, and calculating the evaluation index of crowd positioning.
Further, in the above method, the second loss function in step S2-4 adopts cross entropy loss, which is expressed as:
Figure BDA0003379384040000023
where j refers to the jth picture of the batch input, N refers to the number of pictures of the batch input, p refers to the p-th pixel of each picture, mxn is the pixel size of each picture, γ is used to increase the weight at the head point, and Y (X)p) The prediction label generated by the p pixel of the j picture through the crowd positioning network takes values of 0, 1, psi (X)p) Refers to a true value map of a data set.
Further, in the above method, before the step S2-3, a decoder for generating the crowd position further includes:
the crowd positioning decoder firstly adjusts the output size to 1/4 of the input image through a transposition convolution, then outputs a human head pixel image and a background image through a 1 multiplied by 1 convolution kernel, and finally outputs a positioning image with the same size as the input image through two times of bilinear interpolation adjustment.
Further, in the above method, in step S1-2 or S2-2, the extracted feature information of the picture is sent to the back end of the encoder, and the back end of the encoder adopts five residual error network structures with a multi-scale cavity pyramid convolution aggregation module to increase the depth of the network and extract the multi-scale feature information, including:
firstly, inputting the feature information of the picture extracted in the last step into a multi-scale cavity pyramid convolution aggregation module, firstly, grouping feature maps by the multi-scale cavity pyramid convolution aggregation module, extracting multi-scale information by adopting convolution kernels with different cavity rates of 3 multiplied by 3 in each group, wherein the number of channels in each group is 128; the cavity convolution enlarges the receptive field under the condition of not increasing the parameter number, the convolution kernels with different cavity rates are equivalent to a new convolution kernel, and the size of the equivalent convolution kernel is expressed as:
K=k+(k-1)×(d-1),
where K is the equivalent convolution kernel size, K is the cavity convolution kernel size, d is the cavity rate of the cavity convolution kernel, and is obtained by calculation, a 3 × 3 convolution with d ═ 2 corresponds to one 5 × 5 convolution kernel, a 3 × 3 convolution kernel with d ═ 3 corresponds to one 7 × 7 convolution kernel, and a 3 × 3 convolution kernel with d ═ 4 corresponds to one 9 × 9 convolution kernel;
then aggregating the multi-scale context information extracted by different groups of convolution kernels, processing the characteristic information through batch normalization, and finally sending the characteristic information to a ReLu activation function, wherein the expression is as follows:
Figure BDA0003379384040000031
wherein the content of the first and second substances,
Figure BDA0003379384040000032
representing an ith multi-scale hole pyramidThe output characteristic diagram of the nth group convolution kernel of the tower convolution aggregation module,
Figure BDA0003379384040000033
indicating aggregation of extracted multi-scale information, FbnDenotes batch normalization, RreluIndicating that the feature information is sent to the ReLu activation function.
Further, in the above method, in the 5 multi-scale hole pyramid convolution aggregation modules of step S1-2 or S2-2,
the first three multi-scale void pyramid convolution aggregation modules divide feature information of an input picture into four groups, and perform convolution with a 3 × 3 convolution kernel with a void ratio D of 1, 2, 3, 4 respectively, wherein the number of output channels is 512;
the fourth multi-scale cavity pyramid convolution aggregation module divides the input feature map into two groups, the cavity rate D is 1 and 2, and the number of output channels is 256;
the fifth multi-scale void pyramid convolution aggregation module only adopts a group of 3 multiplied by 3 convolution kernels with the void rate of 1, and the number of output channels is 128.
Further, in the above method, before step S1-2 or S2-2, generating a residual network structure with a multi-scale cavity pyramid convolution aggregation module includes:
firstly, in order to increase the network depth, respectively adding 1 × 1 convolution layers at two ends of a cavity pyramid cavity convolution module, carrying out batch normalization, and inputting the normalized convolution layers to a Rule activation function; naming the structure as a stack, the output map of the stack is represented as:
F(X)=δ2(Φ(δ1(X))),
where X represents the output of the previous layer as input to the residual network, δi(. to) represents the output mapping of the input through a 1 × 1 convolution kernel, batch normalization and ReLu activation function, phi (δ)1(X)) is represented by delta1(X) as input, obtaining an output mapping by a multi-scale void pyramid convolution aggregation module;
and then selecting the structure of the residual error network in a form similar to a switch according to the relation between the output channel number and the input channel number of the stacked layers.
Further, in the above method, selecting a structure of the residual error network in a form similar to a switch according to a relationship between the number of output channels and the number of input channels of the stack layer includes:
for the first three multi-scale void pyramid convolution aggregation modules, the number of input channels is equal to that of output channels, and the network directly adds the output and the input of the stack layer to obtain the output, which is defined as:
Y=F(X)+X,
y represents the output of the residual network;
for the fourth and fifth multi-scale void pyramid convolution aggregation modules, the number of input channels is not equal to the number of output channels, the network sends the input to a convolution kernel of 1 × 1 to change the number of channels, and then adds the number of channels to the output of the stack layer, which is defined as:
Y=F(X)+δ3(X),
where Y represents the output of the residual network, δ3(X) represents the output mapping of the input signature through a 1X 1 convolution kernel, batch normalization and ReLu activation function.
Compared with the prior art, the scale perception crowd counting network with the pyramid cavity convolution is used for increasing the network depth and extracting multi-scale information. The specific steps of the invention for people counting are as follows: inputting a picture, firstly extracting the feature information of the picture through the front end VGG16_ bn of the encoder, and then sequentially inputting the extracted feature map into five residual error networks with cavity pyramid convolution to extract multi-scale information. The network structure is firstly passed through a 1 x 1 convolution kernel, and is subjected to batch normalization and input to an activation function. And then dividing the extracted characteristic graphs into a plurality of groups, extracting multi-scale information by using a plurality of groups of pyramid convolutions with different voidage in a multi-scale void pyramid convolution aggregation module respectively, aggregating the characteristic information, performing batch normalization, and inputting the aggregated characteristic information into an activation function. The first three modules divide the input feature map into four groups, the void ratio D is 1, 2, 3, 4, the fourth module divides the input feature map into two groups, the void ratio D is 1, 2, and the fifth module only adopts a group of 3 × 3 convolution kernels with the void ratio of 1. After the module outputs the characteristic information, the input and the output are added through residual connection to obtain the output mapping of the residual network. And finally, inputting the extracted multi-scale information into a decoder for up-sampling, and outputting an estimated density map. The method comprises the following specific steps of: the pictures are first input to the encoder of the network, whose network structure is the same as that of the people counting. Then the decoder firstly adjusts the output size through a transposition convolution to 1/4 of the input image, then outputs a human head pixel image and a background image through a convolution kernel of 1 multiplied by 1, and finally outputs a positioning image with the same size as the input image through two times of bilinear interpolation adjustment.
The method provided by the invention effectively extracts multi-scale information through a plurality of groups of pyramid convolution kernels with different voidage rates, and solves the problem of non-uniform head size. By adding batch normalization to each layer of output, the problem of difficulty in training caused by increase of network depth is solved, and meanwhile, the depth of the network is further improved through a residual structure under the condition that the parameter number is not increased, so that the robustness is higher.
Drawings
FIG. 1 is a network architecture diagram of population count according to one embodiment of the present invention;
FIG. 2 is a diagram of a network architecture for crowd positioning according to an embodiment of the present invention;
FIG. 3 is a block diagram of a multi-scale void pyramid convolution aggregation module according to an embodiment of the present invention;
fig. 4 is a residual network structure of the multi-scale cavity pyramid convolution aggregation module according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
In the face of the problems of low network depth and multiple scale changes in the prior art, the invention aims to design a method capable of extracting multi-scale information and increasing the network depth.
The invention provides a training method of a crowd counting network, which comprises the following steps: s1, when people group counting is carried out, the network framework is shown in figure 1, and the method comprises the following steps:
the front end of an encoder of the S1-1 network adopts the first ten layers of VGG16_ bn, the sample pictures are input to the front end of the encoder, and the feature information of the pictures is extracted.
S1-2, the feature graph extracted from the front end is sent to the rear end of the encoder, and the residual error network structure with five multi-scale cavity pyramid convolution aggregation modules is adopted to increase the depth of the network and extract multi-scale information;
s1-3, the extracted multi-scale feature information is sent to a decoder for three times of up-sampling, and finally an estimated density map of a channel is output. (ii) a
S1-4, calculating a loss function according to the estimated density graph, and optimizing the network.
S2, when people positioning is carried out, the network framework is shown in figure 2, and the method comprises the following steps:
s2-1, inputting the sample picture to the front end of an encoder with the same crowd counting structure, and extracting the characteristic information of the picture;
s2-2, sending the extracted feature map to the rear end of the encoder, and extracting multi-scale feature information;
s2-3, sending the extracted feature information to a decoder, and outputting a human head pixel image and a background image, wherein the number of channels is 2;
s2-4 optimizes the network according to the loss function.
The multi-scale cavity pyramid convolution aggregation module in the steps S1-2 and S2-2 of the crowd counting and crowd positioning specifically operates as follows:
firstly, inputting the feature map extracted in the last step into a multi-scale cavity convolution aggregation module, as shown in fig. 3, C represents aggregation, BR represents batch normalization and ReLU, and MDC represents a multi-scale cavity golden sub-tower convolution aggregation module. The module firstly groups the feature maps, each group adopts convolution kernels with different void rates and 3 multiplied by 3 to extract multi-scale information, and the number of channels of each group is 128. The cavity convolution can enlarge the receptive field under the condition of not increasing the parameter number, the convolution kernels with different cavity rates can be equivalent to a new convolution kernel, and the size of the equivalent convolution kernel can be expressed as:
K=k+(k-1)×(d-1) (1)
where K is the equivalent convolution kernel size, K is the void convolution kernel size, and d is the void rate of the void convolution kernel. As a result of calculation, the 3 × 3 convolution with d ═ 2 corresponds to one convolution kernel of 5 × 5, the 3 × 3 convolution kernel with d ═ 3 corresponds to one convolution kernel of 7 × 7, and the 3 × 3 convolution kernel with d ═ 4 corresponds to one convolution kernel of 9 × 9.
Then aggregating the multi-scale context information extracted by different groups of convolution kernels, processing the feature information through batch normalization, and finally sending the feature information to a ReLu activation function, which can be expressed as:
Figure BDA0003379384040000071
wherein
Figure BDA0003379384040000072
An output feature map representing the nth set of convolution kernels of the ith multi-scale hole pyramid convolution aggregation module,
Figure BDA0003379384040000073
indicating aggregation of extracted multi-scale information, FbnDenotes batch normalization, RreluIndicating that the feature information is sent to the ReLu activation function.
The scale perception crowd counting network with pyramid cavity convolution has a network structure which totally adopts 5 multi-scale cavity pyramid convolution aggregation modules. The first three modules divide the input feature map into four groups, and the four groups are respectively convoluted with 3 × 3 convolution kernels with the void ratios D being 1, 2, 3 and 4, and the number of output channels is 512. The fourth module divides the input feature map into two groups, the void ratio D is 1, 2, and the number of output channels is 256. The fifth module only uses a set of 3 × 3 convolution kernels with a void rate of 1, and the number of output channels is 128.
Further, the residual network structure with the multi-scale cavity pyramid convolution aggregation module in steps S1-2 and S2-2, as shown in fig. 4, specifically operates as follows:
firstly, in order to increase the network depth, 1 × 1 convolution layers are respectively added at two ends of a cavity pyramid cavity convolution module, batch normalization is carried out, and the normalized convolution layers are input into a Rule activation function. The present invention names this structure as a stack layer, and the output mapping of the stack layer can be expressed as:
F(X)=δ2(Φ(δ1(X))) (3)
x represents the output of the previous layer as input to the residual network, deltai(. to) represents the output mapping of the input through a 1 × 1 convolution kernel, batch normalization and ReLu activation function, phi (δ)1(X)) is represented by delta1(X) as input the output map obtained by the multi-scale hole pyramid convolution aggregation module (see in particular equation (2)).
Then, according to the relation between the output channel number and the input channel number of the stacked layer, a structure of a residual error network is selected in a form similar to a switch, and the specific operation is as follows:
for the first three pyramid hole convolution modules, the number of input channels is equal to that of output channels, and the network directly adds the output and the input of the stack layer to obtain an output, which is defined as:
Y=F(X)+X (4)
y denotes the output of the residual network.
For the fourth and fifth pyramid hole convolution modules, the number of input channels is not equal to the number of output channels, the network first sends the input to a convolution kernel of 1 × 1 to change the number of channels, and then adds the number of channels to the output of the stack layer, which is defined as:
Y=F(X)+δ3(X) (5)
where Y denotes the output of the residual network, δ3(X) represents the output mapping of the input signature through a 1X 1 convolution kernel, batch normalization and ReLu activation function.
Furthermore, the decoder in the scale-aware population counting network with pyramid hole convolution has different upsampling modes. The crowd counting decoder firstly adopts a convolution kernel of 1 multiplied by 1 to output a density map, then carries out transposition convolution for three times, and outputs an estimated density map with the same size as the input picture. The crowd positioning decoder firstly adjusts the output size to 1/4 of the input image through a transposition convolution, then outputs a human head pixel image and a background image through a 1 multiplied by 1 convolution kernel, and finally outputs a positioning image with the same size as the input image through two times of bilinear interpolation adjustment.
Further, the loss function of S1-4 in the crowd-counted network takes the L2 loss, which can be expressed as:
Figure BDA0003379384040000091
where N refers to the number of pictures used for batch training, MiIs a true density map of the network,
Figure BDA0003379384040000092
is an estimated density map of the network.
Further, the loss function of S2-4 in the crowd-sourced network employs cross-entropy loss, which can be expressed as:
Figure BDA0003379384040000093
where j refers to the jth picture of the batch input, N refers to the number of pictures of the batch input, p refers to the p-th pixel of each picture, mxn is the pixel size of each picture, γ is used to increase the weight at the head point, Y (X)p) The prediction label generated by the pth pixel of the jth picture through the crowd positioning network is taken as 0, 1 (human head). Psi (X)p) Refers to a true value map of a data set.
Meanwhile, due to multi-scale change in the image, multi-scale information cannot be extracted by adopting a single convolution kernel, but the multi-scale information can be extracted by the void pyramid convolution under the condition of not increasing the number of parameters, so that the performance of the network is improved.
The invention needs to estimate the total number of people in a picture, predict the position of a head point in the picture and generate a positioning picture, and the specific details are as follows:
estimating the total number of people in a picture
Knowing the pixel value and label of a picture, inputting the picture to the front end of a network encoder, and extracting the characteristic information of the picture through the first ten layers of VGG16_ bn.
Sending the extracted feature information to a first residual error network with a hollow pyramid convolution, wherein the specific implementation details are as follows:
1) the extracted feature information is first sent to a first residual network with a hole pyramid convolution module. The characteristic information is firstly convoluted through a convolution kernel of 1 × 1, and then batch normalization and ReLu activation function are carried out.
2) The feature information is divided into 4 groups, each group of feature maps is sent to pyramid convolution with different void rates, the convolution kernel size is 3 x 3, and the void rates are respectively D1, 2, 3 and 4. Then aggregating the multi-scale context information extracted by different groups of convolution kernels, processing the feature information through batch normalization, and finally sending the feature information to a ReLu activation function, which can be expressed as:
Figure BDA0003379384040000101
wherein
Figure BDA0003379384040000102
An output feature map representing the nth set of convolution kernels of the 1 st multi-scale hole pyramid convolution aggregation module,
Figure BDA0003379384040000103
indicating aggregation of extracted multi-scale information, FbnDenotes batch normalization, RreluIndicating that the feature information is sent to the ReLu activation function.
3) And (3) convolving the extracted multi-scale information by a 1 × 1 convolution kernel, and performing batch normalization and activating functions, wherein the output mapping course of the first stack is represented as:
F(X)=δ2(Φ(δ1(X))) (2)
4) directly adding the input feature map and the output of the multi-scale cavity pyramid convolution aggregation module to obtain the output mapping of the first residual error network with the multi-scale cavity pyramid convolution aggregation module, which can be expressed as:
Y=F(X)+X (3)
and continuously inputting the multi-scale information extracted by the first residual error network into the second residual error network to extract the characteristic information, and repeating the steps until the fifth residual error network extracts the multi-scale characteristic information.
And finally, inputting the extracted multi-scale information into a decoder, outputting a density map by adopting a 1 × 1 convolution kernel, outputting an estimated density map with the same size as the input picture through three times of transposition convolution, and obtaining the predicted number of people through integral summation.
Second, estimating the head position in the picture
Knowing the pixel value and the truth diagram of a picture, inputting the picture into a network encoder, extracting multi-scale information of the picture, wherein the implementation details of the multi-scale information are the same as those of an encoder of a crowd counting network, and finally outputting the multi-scale information.
And then inputting the extracted characteristic information into a crowd positioning decoder, wherein the size of the extracted characteristic information is 1/4 of the input image after a transposition convolution adjustment, then a human head pixel image and a background image are output through a 1 x 1 convolution kernel, and finally a positioning image with the same size as the input image is output through two times of bilinear interpolation adjustment.
In summary, the scale-aware crowd counting network with pyramid hole convolution provided by the invention is used for increasing the network depth and extracting multi-scale information. The specific steps of the invention for people counting are as follows: inputting a picture, firstly extracting the feature information of the picture through the front end VGG16_ bn of the encoder, and then sequentially inputting the extracted feature map into five residual error networks with cavity pyramid convolution to extract multi-scale information. The network structure is firstly passed through a 1 x 1 convolution kernel, and is subjected to batch normalization and input to an activation function. And then dividing the extracted characteristic graphs into a plurality of groups, extracting multi-scale information by using a plurality of groups of pyramid convolutions with different voidage in a multi-scale void pyramid convolution aggregation module respectively, aggregating the characteristic information, performing batch normalization, and inputting the aggregated characteristic information into an activation function. The first three modules divide the input feature map into four groups, the void ratio D is 1, 2, 3, 4, the fourth module divides the input feature map into two groups, the void ratio D is 1, 2, and the fifth module only adopts a group of 3 × 3 convolution kernels with the void ratio of 1. After the module outputs the characteristic information, the input and the output are added through residual connection to obtain the output mapping of the residual network. And finally, inputting the extracted multi-scale information into a decoder for up-sampling, and outputting an estimated density map. The method comprises the following specific steps of: the pictures are first input to the encoder of the network, whose network structure is the same as that of the people counting. Then the decoder firstly adjusts the output size through a transposition convolution to 1/4 of the input image, then outputs a human head pixel image and a background image through a convolution kernel of 1 multiplied by 1, and finally outputs a positioning image with the same size as the input image through two times of bilinear interpolation adjustment.
The method provided by the invention effectively extracts multi-scale information through a plurality of groups of pyramid convolution kernels with different voidage rates, and solves the problem of non-uniform head size. By adding batch normalization to each layer of output, the problem of difficulty in training caused by increase of network depth is solved, and meanwhile, the depth of the network is further improved through a residual structure under the condition that the parameter number is not increased, so that the robustness is higher.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A method for training a people counting network, comprising: step one, when people are counted, the method comprises the following steps:
step S1-1, the front end of an encoder of the network adopts the first ten layers of VGG16_ bn, the sample pictures are input to the front end of the encoder, and the feature information of the pictures is extracted;
step S1-2, the feature information of the extracted picture at the front end of the encoder is sent to the rear end of the encoder, and the rear end of the encoder adopts five residual error network structures with multi-scale cavity pyramid convolution aggregation modules for increasing the depth of the network and extracting the multi-scale feature information;
step S1-3, the extracted multi-scale feature information is sent to a crowd counting decoder for three times of up-sampling, and finally an estimated density map of one channel is output;
and step S1-4, calculating a first loss function according to the estimated density map of one channel, optimizing the network according to the first loss function, and calculating the evaluation index of the crowd count.
2. A method for training a people counting network as claimed in claim 1, wherein, before step S1-3, further comprising a decoder for generating people counts, comprising:
the crowd counting decoder firstly adopts a convolution kernel of 1 multiplied by 1 to output a density map, then carries out transposition convolution for three times, and outputs an estimated density map with the same size as the input picture.
3. The method for training a people counting network according to claim 1, wherein the first loss function in step S1-4 is represented by L2 loss as:
Figure FDA0003379384030000011
where N refers to the number of pictures used for batch training, MiIs a true density map of the network,
Figure FDA0003379384030000012
is an estimated density map of the network.
4. The method for training a people counting network according to claim 1, further comprising a second step of performing people positioning, comprising the steps of:
s2-1, adopting the first ten layers of VGG16_ bn at the front end of an encoder of the network, inputting a sample picture to the front end, and extracting the characteristic information of the picture;
step S2-2, the extracted feature information of the picture is sent to the rear end of a coder, and the rear end of the coder adopts five residual error network structures with multi-scale cavity pyramid convolution aggregation modules for increasing the depth of the network and extracting the multi-scale feature information;
step S2-3, sending the extracted multi-scale feature information to a crowd positioning decoder, and outputting a head pixel map and a background map, wherein the number of channels is 2;
and step S2-4, calculating a second loss function according to a head pixel image map and a background map, optimizing the network according to the second loss function, and calculating the evaluation index of crowd positioning.
5. The method for training a people counting network according to claim 4, wherein the second loss function in step S2-4 is a cross-entropy loss expressed as:
Figure FDA0003379384030000021
where j refers to the jth picture of the batch input, N refers to the number of pictures of the batch input, p refers to the p-th pixel of each picture, mxn is the pixel size of each picture, γ is used to increase the weight at the head point, and Y (X)p) The prediction label generated by the p pixel of the j picture through the crowd positioning network takes values of 0, 1, psi (X)p) Refers to a true value map of a data set.
6. The method for training a people counting network according to claim 4, wherein before step S2-3, further comprising a decoder for generating people positions, comprising:
the crowd positioning decoder firstly adjusts the output size to 1/4 of the input image through a transposition convolution, then outputs a human head pixel image and a background image through a 1 multiplied by 1 convolution kernel, and finally outputs a positioning image with the same size as the input image through two times of bilinear interpolation adjustment.
7. The training method of people counting network as claimed in claim 1 or 4, wherein the step S1-2 or S2-2 is to send the extracted feature information of the picture to the back end of the encoder, and the back end of the encoder adopts five residual network structures with multi-scale hole pyramid convolution aggregation modules for increasing the depth of the network and extracting the multi-scale feature information, and comprises:
firstly, inputting the feature information of the picture extracted in the last step into a multi-scale cavity pyramid convolution aggregation module, firstly, grouping feature maps by the multi-scale cavity pyramid convolution aggregation module, extracting multi-scale information by adopting convolution kernels with different cavity rates of 3 multiplied by 3 in each group, wherein the number of channels in each group is 128; the cavity convolution enlarges the receptive field under the condition of not increasing the parameter number, the convolution kernels with different cavity rates are equivalent to a new convolution kernel, and the size of the equivalent convolution kernel is expressed as:
K=k+(k-1)×(d-1),
where K is the equivalent convolution kernel size, K is the cavity convolution kernel size, d is the cavity rate of the cavity convolution kernel, and is obtained by calculation, a 3 × 3 convolution with d ═ 2 corresponds to one 5 × 5 convolution kernel, a 3 × 3 convolution kernel with d ═ 3 corresponds to one 7 × 7 convolution kernel, and a 3 × 3 convolution kernel with d ═ 4 corresponds to one 9 × 9 convolution kernel;
then aggregating the multi-scale context information extracted by different groups of convolution kernels, processing the characteristic information through batch normalization, and finally sending the characteristic information to a ReLu activation function, wherein the expression is as follows:
Figure FDA0003379384030000031
wherein the content of the first and second substances,
Figure FDA0003379384030000032
an output feature map representing the nth set of convolution kernels of the ith multi-scale hole pyramid convolution aggregation module,
Figure FDA0003379384030000033
indicating aggregation of extracted multi-scale information, FbnDenotes batch normalization, RreluIndicating that the feature information is sent to the ReLu activation function.
8. The training method of people counting network of claim 1 or 4, wherein in the 5 multi-scale hole pyramid convolution aggregation modules of step S1-2 or S2-2,
the first three multi-scale void pyramid convolution aggregation modules divide feature information of an input picture into four groups, and perform convolution with a 3 × 3 convolution kernel with a void ratio D of 1, 2, 3, 4 respectively, wherein the number of output channels is 512;
the fourth multi-scale cavity pyramid convolution aggregation module divides the input feature map into two groups, the cavity rate D is 1 and 2, and the number of output channels is 256;
the fifth multi-scale void pyramid convolution aggregation module only adopts a group of 3 multiplied by 3 convolution kernels with the void rate of 1, and the number of output channels is 128.
9. A training method of a people counting network according to any of the claims 1 or 4, characterized in that before step S1-2 or S2-2, further comprising generating a residual network structure with a multi-scale hole pyramid convolution aggregation module, comprising:
firstly, in order to increase the network depth, respectively adding 1 × 1 convolution layers at two ends of a cavity pyramid cavity convolution module, carrying out batch normalization, and inputting the normalized convolution layers to a Rule activation function; naming the structure as a stack, the output map of the stack is represented as:
F(X)=δ2(Φ(δ1(X))),
where X represents the output of the previous layer as input to the residual network, δi(. to) represents the output mapping of the input through a 1 × 1 convolution kernel, batch normalization and ReLu activation function, phi (δ)1(X)) is represented by delta1(X) as input, obtaining an output mapping by a multi-scale void pyramid convolution aggregation module;
and then selecting the structure of the residual error network in a form similar to a switch according to the relation between the output channel number and the input channel number of the stacked layers.
10. A training method for a people counting network according to any one of claim 9, wherein the structure of the residual network is selected in a switch-like manner according to the relation between the output channel number and the input channel number of the stack layer, and the method comprises:
for the first three multi-scale void pyramid convolution aggregation modules, the number of input channels is equal to that of output channels, and the network directly adds the output and the input of the stack layer to obtain the output, which is defined as:
Y=F(X)+X,
y represents the output of the residual network;
for the fourth and fifth multi-scale void pyramid convolution aggregation modules, the number of input channels is not equal to the number of output channels, the network sends the input to a convolution kernel of 1 × 1 to change the number of channels, and then adds the number of channels to the output of the stack layer, which is defined as:
Y=F(X)+δ3(X),
where Y represents the output of the residual network, δ3(X) represents the output mapping of the input signature through a 1X 1 convolution kernel, batch normalization and ReLu activation function.
CN202111449140.3A 2021-11-29 2021-11-29 Training method of crowd counting network Active CN114154620B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111449140.3A CN114154620B (en) 2021-11-29 2021-11-29 Training method of crowd counting network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111449140.3A CN114154620B (en) 2021-11-29 2021-11-29 Training method of crowd counting network

Publications (2)

Publication Number Publication Date
CN114154620A true CN114154620A (en) 2022-03-08
CN114154620B CN114154620B (en) 2024-05-21

Family

ID=80455410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111449140.3A Active CN114154620B (en) 2021-11-29 2021-11-29 Training method of crowd counting network

Country Status (1)

Country Link
CN (1) CN114154620B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115763167A (en) * 2022-11-22 2023-03-07 黄华集团有限公司 Solid cabinet breaker and control method thereof
CN115775227A (en) * 2022-10-12 2023-03-10 浙江吉昌新材料有限公司 Intelligent production method of anti-cracking sagger and control system thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200118423A1 (en) * 2017-04-05 2020-04-16 Carnegie Mellon University Deep Learning Methods For Estimating Density and/or Flow of Objects, and Related Methods and Software
CN111242036A (en) * 2020-01-14 2020-06-05 西安建筑科技大学 Crowd counting method based on encoding-decoding structure multi-scale convolutional neural network
WO2020156028A1 (en) * 2019-01-28 2020-08-06 南京航空航天大学 Outdoor non-fixed scene weather identification method based on deep learning
CN111611878A (en) * 2020-04-30 2020-09-01 杭州电子科技大学 Method for crowd counting and future people flow prediction based on video image

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200118423A1 (en) * 2017-04-05 2020-04-16 Carnegie Mellon University Deep Learning Methods For Estimating Density and/or Flow of Objects, and Related Methods and Software
WO2020156028A1 (en) * 2019-01-28 2020-08-06 南京航空航天大学 Outdoor non-fixed scene weather identification method based on deep learning
CN111242036A (en) * 2020-01-14 2020-06-05 西安建筑科技大学 Crowd counting method based on encoding-decoding structure multi-scale convolutional neural network
CN111611878A (en) * 2020-04-30 2020-09-01 杭州电子科技大学 Method for crowd counting and future people flow prediction based on video image

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
严芳芳;吴秦;: "多通道融合分组卷积神经网络的人群计数算法", 小型微型计算机***, no. 10 *
孟月波;纪拓;刘光辉;徐胜军;李彤月;: "编码-解码多尺度卷积神经网络人群计数方法", 西安交通大学学报, no. 05 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115775227A (en) * 2022-10-12 2023-03-10 浙江吉昌新材料有限公司 Intelligent production method of anti-cracking sagger and control system thereof
CN115763167A (en) * 2022-11-22 2023-03-07 黄华集团有限公司 Solid cabinet breaker and control method thereof
CN115763167B (en) * 2022-11-22 2023-09-22 黄华集团有限公司 Solid cabinet circuit breaker and control method thereof

Also Published As

Publication number Publication date
CN114154620B (en) 2024-05-21

Similar Documents

Publication Publication Date Title
CN109858461B (en) Method, device, equipment and storage medium for counting dense population
US11157814B2 (en) Efficient convolutional neural networks and techniques to reduce associated computational costs
CN114154620A (en) Training method of crowd counting network
CN111582483B (en) Unsupervised learning optical flow estimation method based on space and channel combined attention mechanism
CN111445418A (en) Image defogging method and device and computer equipment
CN110533712A (en) A kind of binocular solid matching process based on convolutional neural networks
CN109389667B (en) High-efficiency global illumination drawing method based on deep learning
CN110689599A (en) 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement
CN113408577A (en) Image classification method based on attention mechanism
CN113239904B (en) High-resolution dense target counting method based on convolutional neural network
CN113553582A (en) Malicious attack detection method and device and electronic equipment
CN114821251B (en) Method and device for determining point cloud up-sampling network
CN111986085A (en) Image super-resolution method based on depth feedback attention network system
CN110570402A (en) Binocular salient object detection method based on boundary perception neural network
CN112509021A (en) Parallax optimization method based on attention mechanism
CN111783862A (en) Three-dimensional significant object detection technology of multi-attention-directed neural network
US10698918B2 (en) Methods and systems for wavelet based representation
CN112215241A (en) Image feature extraction device based on small sample learning
CN116630388A (en) Thermal imaging image binocular parallax estimation method and system based on deep learning
CN114120233A (en) Training method of lightweight pyramid hole convolution aggregation network for crowd counting
CN116434039A (en) Target detection method based on multiscale split attention mechanism
CN111428809A (en) Crowd counting method based on spatial information fusion and convolutional neural network
CN116630152A (en) Image resolution reconstruction method and device, storage medium and electronic equipment
EP2153403A1 (en) Method and device for down-sampling a dct image in the dct domain
CN114677282A (en) Image super-resolution reconstruction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant