CN112966600A

CN112966600A - Adaptive multi-scale context aggregation method for crowded crowd counting

Info

Publication number: CN112966600A
Application number: CN202110242403.7A
Authority: CN
Inventors: 赵怀林; 梁兰军; 张亚妮; 周方波
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-06-15
Anticipated expiration: 2041-03-04
Also published as: CN112966600B

Abstract

The invention provides a self-adaptive multi-scale context aggregation method for crowd counting, which comprises the following steps: inputting the sample picture into a backbone network, and extracting a characteristic diagram with the size being j times of the resolution of the input image; inputting the extracted feature graph into a plurality of multi-scale context aggregation modules in a cascading mode, extracting and adaptively aggregating multi-scale context information to obtain multi-scale context features; performing convolution layer processing on the generated multi-scale context characteristics to generate a density graph; and integrating and summing the density maps to obtain the predicted number of people. The invention effectively extracts multi-scale information, solves the problem of non-uniform head size, adaptively selects and aggregates useful context information through a channel attention mechanism, avoids redundancy of the information, can have more accurate density estimation in crowded scenes and has higher robustness.

Description

Adaptive multi-scale context aggregation method for crowded crowd counting

Technical Field

The invention relates to the technical field of data processing, in particular to an adaptive multi-scale context aggregation method for crowd counting.

Background

People counting is a basic task of computer vision-based people analysis, aiming at automatically detecting crowding conditions.

However, in a crowd scene, tasks often encounter some challenging factors, such as severe occlusion, scale change, diversity of crowd distribution, and the like, especially in a very crowded scene, it becomes difficult to estimate the crowdedness due to the visual similarity of the foreground crowd and the background object and the scale change of the human head.

Currently, networks for directly aggregating contextual features of different scales exist, but not all features are useful for final population counting, and the performance of the counting network is affected due to redundancy of information caused by direct aggregation.

Disclosure of Invention

In view of the deficiencies in the prior art, it is an object of the present invention to provide an adaptive multi-scale context aggregation method for crowd counting.

The invention provides an adaptive multi-scale context aggregation method for crowd counting, which comprises the following steps:

step 1: inputting the sample picture into a backbone network, and extracting a characteristic diagram with the size being i times of the resolution of the input image;

step 2: inputting the extracted feature graph into a plurality of multi-scale context aggregation modules in a cascading mode, extracting and adaptively aggregating multi-scale context information to obtain multi-scale context features; an up-sampling layer is arranged behind each multi-scale context aggregation module and used for converting the multi-scale context features into a feature map with higher resolution;

and step 3: performing convolution layer processing on the generated multi-scale context characteristics to generate a density graph;

and 4, step 4: calculating a loss function between the generated density map and the true density map, and optimizing network parameters;

and 5: and performing integral summation on the generated density map to obtain the predicted number of people.

Optionally, the step 4 includes:

generating a truth-value density map of the crowd through Gaussian kernel convolution according to the picture with the head mark point, wherein the calculation formula of the density map is as follows:

wherein, F_i(x) Representing a truth density map, x_iPixel points representing a person's head, G_σExpressing a Gaussian kernel, delta (·) expressing a Dirac function, sigma being a standard deviation, N representing the total number of people in the picture, and x expressing a pixel point of the picture.

Optionally, the step 2 includes:

the multi-scale context aggregation module adaptively selects small-scale context features and aggregates the small-scale context features and the large-scale context features; the multi-scale context aggregation module comprises a plurality of branches of hole convolution with different hole rates;

by using

To represent features extracted by the convolution of holes at the ith scale; where i represents the void rate of the convolution kernel,

indicating that the resolution is j times the resolution of the input image, r indicates the reduction rate of the backbone network,

representing a feature map of void convolution extraction of the ith scale, wherein the resolution of the feature map is j times of the original resolution; w multiplied by H represents the resolution of the image, C represents the channel number of the image, and R represents the set of all the feature maps with j times of resolution;

inputting the feature map extracted by the convolution of the hole into a channel attention module, wherein the channel attention module adopts a selection function f to select in an adaptive way

Useful context feature information, and output a feature map Y in which the context information is aggregated^j∈R^jW×jH×CWhere Yj is defined as follows:

Y^ja feature map representing the j times resolution extracted by the aggregation module,

meaning that the summation is element-by-element,

the feature map of the 1 st scale is shown to be extracted,

the feature map of the 2 nd scale is shown to be extracted,

the feature map of the 3 rd scale is shown to be extracted,

and j represents that the characteristic map of the nth scale is extracted, and the resolution is j times of the resolution of the input picture.

Optionally, said adaptively selecting using a selection function f

Context feature information useful in this context, including:

performing pooling treatment on each context feature through a global space average pooling layer, and outputting feature information

With a bottleneck structure consisting of two fully-connected layersFor the characteristic information F_avgProcessing is carried out, and output characteristics are normalized to (0, 1) through a sigmoid function, wherein the calculation formula of the self-adaptive output coefficient is as follows:

in the formula:

and

respectively representing the weight coefficients of two fully connected layers, wherein the first fully connected layer is followed by a RELU function, the second fully connected layer is followed by a Sigmoid function,

to represent

Averaging the output after pooling;

adding a residual connection between the input and the output of the channel attention mechanism, resulting in a selection function defined as follows:

in the formula:

the output of the ith channel attention mechanism module is shown,

representing a feature map representing the convolution extraction of holes at the ith scale,

attention mechanism module for indicating ith channelThe adaptive coefficient of (2).

Compared with the prior art, the invention has the following beneficial effects:

the self-adaptive multi-scale context aggregation method for counting crowds effectively extracts multi-scale information, solves the problem of nonuniform head sizes, avoids information redundancy through self-adaptive selection and aggregation of useful context information through a channel attention mechanism, can realize more accurate density estimation in crowded scenes, and has higher robustness.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

fig. 1 is a schematic diagram of an adaptive multi-scale context aggregation method for crowd counting according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention provides an adaptive multi-scale context aggregation method for crowd counting, which is used for crowd density estimation in a crowd scene. The method mainly comprises the following steps: inputting a picture, firstly extracting characteristic information through a backbone network, and then inputting the extracted characteristic graph into a plurality of multi-scale context aggregation modules in a cascading mode. The module firstly extracts multi-scale information by using convolution kernels with different void ratios, and then adaptively selects channel context characteristic information through a channel attention mechanism and carries out aggregation. And each time the multi-scale context aggregation module passes through, converting the feature map into a feature map with higher resolution by means of upsampling, finally outputting an estimated density map by means of a 1-by-1 convolution kernel, and obtaining the predicted number of people by means of integral summation. The method provided by the invention effectively extracts multi-scale information through a plurality of convolution kernels with different void ratios, solves the problem of nonuniform head sizes, avoids information redundancy through self-adaptive selection and aggregation of useful context information through a channel attention mechanism, can realize more accurate density estimation in crowded scenes, and has higher robustness.

Fig. 1 is a schematic diagram of a principle of an adaptive multi-scale context aggregation method for crowd counting according to an embodiment of the present invention, as shown in fig. 1, the method may include the following steps:

step S1: and inputting the sample picture into a backbone network, and extracting a feature map with the size being i times of the resolution of the original image.

Step S2: and inputting the extracted feature maps into a plurality of self-adaptive multi-scale context aggregation modules in a cascading mode, extracting and self-adaptively aggregating multi-scale context information, wherein an up-sampling layer is arranged behind each module and used for converting the multi-scale context features into feature maps with higher resolution.

Step S3: and performing 1 × 1 convolutional layer processing on the generated multi-scale context features to generate a density map.

Step S4: calculating a loss function between the generated density map and the true density map, and optimizing network parameters;

step S5: and integrating and summing the density maps to obtain the predicted number of people.

In this embodiment, according to the picture with the head mark point, the true density map of the crowd is generated through the gaussian kernel convolution, and the pixel point with the head is represented as x_iThe Gaussian kernel is denoted G_σThen the true density map can be expressed as:

wherein, F_i(x) Representing a truth density map, x_iPixel points representing a person's head, G_σExpressing the Gaussian kernel, delta (. cndot.) expressing the Dirac function, sigma being the standard deviation, and N representing the graphThe total number of slices, x, represents the pixel points of the picture.

Specifically, the adaptive multi-scale context aggregation module of step S2 is shown in fig. 1, which adaptively selects reliable small-scale context features and aggregates them with large-scale context features. The specific operation is as follows:

the multi-scale context aggregation module comprises a plurality of branches of hole convolution with different hole rates, and is used for

representing a feature map of void convolution extraction of the ith scale, wherein the resolution of the feature map is i times of the original resolution, W multiplied by H represents the resolution of an image, C represents the number of channels of the image, and R represents a set of all feature maps of j times of resolution; the feature maps of the hole convolution extraction are then input into a channel attention module (CA) which adaptively selects using a selection function f

Useful context feature information, and finally outputting a feature graph Y aggregated with the context information^j∈R^jW×jH×CThe definition is as follows:

in the formula: y is^jA feature map representing the j times resolution extracted by the aggregation module,

meaning that the summation is element-by-element,

the feature map of the 1 st scale is shown to be extracted,

the feature map of the 2 nd scale is shown to be extracted,

the feature map of the 3 rd scale is shown to be extracted,

Illustratively, the selection function f adopts a channel attention mechanism for aggregating multi-scale context information, and the specific operations are as follows:

each feature is first passed through a global spatial averaging pooling layer (denoted as F)_avg) Then, the features are processed by adopting a bottleneck structure consisting of two completely connected layers, and finally, the output features are normalized to (0, 1) through a sigmoid function. The adaptive output coefficient may be expressed as:

in the formula:

and

to represent

Averaging the output after pooling;

furthermore, for better optimization, a residual connection is added between the input and output of the channel attention mechanism, and the final selection function is defined as:

compared with the existing counting method, the method adopts convolution with different void rates to extract multi-scale information, self-adaptive selection and aggregation of the multi-scale context information are performed through a channel attention mechanism, good performance is shown in a crowd scene, and the crowd counting precision is improved.

The technical solution of the present invention will be described in more detail with reference to specific examples. When the pixel value and the label of a picture are known, a true value density map corresponding to the picture is obtained through gaussian convolution, and the true value density map can be represented as:

in the formula, xi represents a pixel point with a head, x represents all pixel points, and G_σExpressed as a gaussian kernel, δ (·) represents the dirac function, σ is the standard deviation, and N represents the total number of people in the picture.

Then, learning a complex nonlinear mapping from the input image to the crowd estimation density map through a multi-scale context aggregation network, wherein the specific details are as follows:

the front ten layers of VGG-16 are selected as a backbone network, pictures are input into the backbone network, feature information is extracted, and the size of a feature map is 1\8 of the size of an input image.

And (4) convolving the extracted feature map by using a 3-by-3 convolution kernel, and then sending the feature information to the multi-scale context aggregation module. Firstly, extracting different scale characteristics through a plurality of void convolution branches with different void rates, and recording each scale characteristic as

There are a total of n scale information.

Will be provided with

The multi-scale context information is adaptively aggregated by the attention module. The method comprises the steps of firstly extracting context information through a global space average pooling layer, then processing features by adopting a bottleneck structure formed by two completely connected layers, and finally normalizing output features to be (0, 1) through a sigmoid function. The adaptive output coefficient may be expressed as:

finally, we directly perform residual connection on the input and output of the channel attention mechanism, and the final output result is:

will be provided with

Multi-scale contextual feature information selected by attention mechanism

And 2 nd scale information

The pixel-by-pixel summation is performed, which can be expressed as:

extract the obtained

The feature information is sent to a channel attention mechanism to self-adaptively select context information, pixel summation is carried out on the feature information and the feature information of the 3 rd scale, and the like, and finally the feature mapping which aggregates the multi-scale context information is obtained:

after multi-scale context information is extracted by the multi-scale context aggregation module, the multi-scale context information is converted into a feature map with higher resolution by up-sampling. And then sending the data to a multi-scale context aggregation module for feature extraction in the same mode, passing through three multi-scale context aggregation modules all the time, finally outputting an estimated density map through a 1 x 1 convolution kernel, and calculating a loss function L (theta):

wherein F (I)_i(ii) a θ) is a density map of the output of the network, F_iThe method is characterized in that the method is a real density graph, theta is a parameter required to be optimized by a network, and the network continuously optimizes the parameter theta through a gradient descent method to find a parameter value which enables a loss function to be minimum.

It should be noted that, the steps in the adaptive multi-scale context aggregation method for counting crowds provided by the present invention may be implemented by using corresponding modules, devices, units, and the like in the adaptive multi-scale context aggregation system for counting crowds, and those skilled in the art may refer to the technical scheme of the system to implement the step flow of the method, that is, the embodiment in the system may be understood as a preferred embodiment for implementing the method, and details are not repeated here.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices provided by the present invention in purely computer readable program code means, the method steps can be fully programmed to implement the same functions by implementing the system and its various devices in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices thereof provided by the present invention can be regarded as a hardware component, and the devices included in the system and various devices thereof for realizing various functions can also be regarded as structures in the hardware component; means for performing the functions may also be regarded as structures within both software modules and hardware components for performing the methods.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. An adaptive multi-scale context aggregation method for crowd counting, comprising:

step 1: inputting the sample picture into a backbone network, and extracting a characteristic diagram with the size being j times of the resolution of the input image;

2. The adaptive multi-scale context aggregation method for crowd counting according to claim 1, wherein the step 4 comprises:

3. The adaptive multi-scale context aggregation method for crowd counting according to claim 1, wherein the step 2 comprises:

by using

Useful context feature information, and output a feature map Y in which the context information is aggregated^j∈R^jW×jH×CWherein Y is^jThe definition is as follows:

meaning that the summation is element-by-element,

the feature map of the 1 st scale is shown to be extracted,

the feature map of the 2 nd scale is shown to be extracted,

the feature map of the 3 rd scale is shown to be extracted,

4. The adaptive multi-scale context aggregation method for crowd counting according to claim 3, wherein the adaptive selection using a selection function f

Context feature information useful in this context, including:

Using a bottleneck-structure pair consisting of two fully-connected layers_avgProcessing is carried out, and output characteristics are normalized to (0, 1) through a sigmoid function, wherein the calculation formula of the self-adaptive output coefficient is as follows:

in the formula:

and

to represent

Averaging the output after pooling;

in the formula:

the output of the ith channel attention mechanism module is shown,

the adaptive coefficients of the ith channel attention mechanism module are shown.