CN113192009B

CN113192009B - Crowd counting method and system based on global context convolutional network

Info

Publication number: CN113192009B
Application number: CN202110382645.6A
Authority: CN
Inventors: 康春萌; 孟琛; 盛星; 吕蕾
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2022-09-02
Anticipated expiration: 2041-04-09
Also published as: CN113192009A

Abstract

The invention provides a crowd counting method and a system based on a global context convolutional network, wherein the method respectively extracts a low-level feature map and a high-level feature map of an image to be counted; respectively extracting multi-scale features from the low-level feature map and the high-level feature map to obtain a feature map with multi-scale information; by capturing the spatial information and the channel information, the global context characteristics are aggregated to each pixel to obtain a characteristic diagram with the context information, and the remote dependency relationship among the pixels is obtained, so that the characteristic diagram contains richer information; and a crowd density map is obtained through upsampling and feature fusion, so that the crowd counting precision is improved.

Description

Crowd counting method and system based on global context convolutional network

Technical Field

The invention belongs to the field of deep learning and computer vision, and particularly relates to a population counting method and system based on a global context convolutional network.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

In recent years, due to the wide application of population counting in public safety, city planning, traffic control, and the like, there has been a continuous interest in the field of computer vision. The goal of crowd counting is to accurately estimate the number of people from a still image or frame. Due to the factors such as the shooting angle of the camera, the distance difference between different people in the crowd and the camera and the like, the shot images have the problems of scale change, serious shielding, irrelevant background and the like, so that the accuracy of the crowd counting algorithm is greatly influenced.

At present, the CNN-based method has become the mainstream method of population counting research, and the network architecture thereof is mainly divided into a single-column architecture and a multi-column architecture. The single-column architecture is generally a single multilayer convolutional neural network, and the network structure is simple but lacks detail information and spatial information; the multi-column architecture generally adopts a multi-scale or multi-column structure to capture richer feature information, but the structure is complex, the calculation complexity is high, and most methods do not fully utilize context information and proportion information. For this reason, some recent crowd counting methods try to introduce strategies such as hole convolution, pyramid network, attention model, etc. to improve the existing architecture, but there still exists a great challenge in dealing with the problems of scale change and severe occlusion.

Disclosure of Invention

The invention aims to solve the problems and provides a crowd counting method and system based on a global context convolutional network.

According to some embodiments, the invention adopts the following technical scheme:

a crowd counting method based on a global context convolutional network comprises the following steps:

acquiring a crowd image to be counted;

extracting a low-level feature map and a high-level feature map of the crowd image;

carrying out scale perception on the low-level feature map and the high-level feature map to obtain an enhanced low-level feature map and an enhanced high-level feature map;

sequentially carrying out context modeling and feature conversion on the enhanced low-level feature graph and the enhanced high-level feature graph, extracting global context features, and obtaining the low-level feature graph and the high-level feature graph which are blended with global context information through feature fusion;

determining a density map according to the low-level feature map and the high-level feature map which are merged into the global context information;

population counts were made from the density map.

As a further limitation, the specific steps of performing scale perception on the low-level feature map and the high-level feature map to obtain an enhanced low-level feature map and an enhanced high-level feature map include:

compressing channels by four convolution operations on the low-level feature map and the high-level feature map to obtain compressed feature maps;

extracting a multi-scale characteristic diagram from the compressed low-level characteristic diagram and the high-level characteristic diagram through convolution of cavities with different expansion rates;

and splicing the extracted multi-scale feature maps according to a channel splicing method to obtain an enhanced low-level feature map and an enhanced high-level feature map.

As a further limitation, the specific steps of the context modeling are as follows:

performing convolution operation on the characteristic diagram and the linear transformation matrix, and normalizing the attention weight value through a softmax function to obtain a normalized attention weight value;

and carrying out reshape operation on the feature map, and carrying out matrix multiplication on the feature map and the normalized attention weight value to obtain the initial global context feature.

As a further limitation, the specific steps of feature transformation include:

firstly, performing convolution operation on the initial global context feature and a linear transformation matrix, then sequentially performing LayerNorm and Relu operation, and finally completing feature transformation through 1 × 1 convolution to obtain the global context feature.

By way of further limitation, the feature fusion aggregates global context features to each position of the enhanced low-level feature map and the enhanced high-level feature map through broadcast element addition operation, so that each position can acquire global context information, and the low-level feature map and the high-level feature map which are fused with the global context information are obtained.

As a further limitation, the specific step of determining the density map according to the low-level feature map and the high-level feature map merged into the global context information is as follows:

carrying out up-sampling operation on the high-level feature graph integrated with the global context information;

and performing splicing operation and convolution operation on the high-level feature graph subjected to the upsampling operation and the low-level feature graph fused with the global context information to obtain a density graph.

As a further limitation, the population counting according to the density map specifically includes: the predicted population is obtained by integrating and summing the density maps.

A population counting system based on a global context convolutional network, comprising:

the image acquisition module is used for acquiring a crowd image to be counted;

the characteristic extraction module is used for extracting a low-level characteristic diagram and a high-level characteristic diagram of the crowd image;

the scale perception module is used for carrying out scale perception on the low-level feature map and the high-level feature map to obtain an enhanced low-level feature map and an enhanced high-level feature map;

the global context module is used for sequentially carrying out context modeling and feature conversion on the enhanced low-level feature graph and the enhanced high-level feature graph, extracting global context features and obtaining the low-level feature graph and the high-level feature graph which are blended with global context information through feature fusion;

the density map determining module is used for determining a density map according to the low-level feature map and the high-level feature map which are blended into the global context information;

and the people counting module is used for counting people according to the density map.

A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute a global context convolutional network based crowd counting method.

A terminal device comprising a processor and a computer readable storage medium, the processor being configured to implement instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the population counting method based on the global context convolutional network.

Compared with the prior art, the invention has the beneficial effects that:

compared with the standard convolution, the method has the advantages that the hole convolution is used, so that the method has a larger receptive field when the feature graph is subjected to convolution operation, contains more local context information and reduces the calculation complexity.

The invention uses the hole convolution with different expansion rates to form a proportional pyramid type network, and compared with the traditional convolution operation with convolution kernels of different sizes, the invention has simpler structure and smaller complexity while extracting the multi-proportion information of the characteristic diagram.

The invention extracts the global context characteristics of the low-level characteristic diagram and the high-level characteristic diagram, obtains the low-level characteristic diagram and the high-level characteristic diagram which are blended with the global context information through characteristic fusion, and captures the dependency relationship between channels, thereby ensuring that each position of the image can obtain the global context information, obtaining the remote dependency relationship between pixels and ensuring that the characteristic diagram contains richer information.

The global context module used by the invention belongs to a lightweight computing module, so that the model has less resource consumption and higher computing efficiency.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is an overall flow diagram of the present invention;

FIG. 2 is a schematic diagram of the present invention;

FIG. 3 is a scale-aware schematic of the present invention;

FIG. 4 is a diagram illustrating the global context information extraction and fusion principle of the present invention.

The specific implementation mode is as follows:

the invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example 1

In one or more technical solutions disclosed in one or more embodiments, a population counting method based on a global context convolutional network is provided, which is used for processing extracted low-level features and extracted high-level features respectively by using a scale perception module and a global context module at the same time so as to capture rich scale information and context information, and finally achieving the purpose of predicting a density map more accurately.

As shown in fig. 1 and fig. 2, a population counting method based on a global context convolutional network includes the following specific steps:

step 1, acquiring a crowd image to be counted;

step 2, extracting a low-level feature map and a high-level feature map of the crowd image, namely extracting image features, and respectively extracting local feature maps with the low-level features and the high-level features from the image to be counted;

the first five layers of VGG-16Net are adopted in the low-level feature map and the high-level feature map of the extracted crowd image, the feature map output by the third layer is used as the low-level feature map, and the feature maps output by the fourth layer and the fifth layer are used as the high-level feature map; given image I, the output profile of VGG-16Net can be expressed as:

f _v ＝F _vgg (I)

through a backbone network of VGG-16Net, the characteristic information of the image can be preliminarily extracted.

And 3, respectively carrying out scale perception on the low-level feature map and the high-level feature map, and extracting multi-scale information of the low-level feature map and the high-level feature map to obtain an enhanced low-level feature map and an enhanced high-level feature map. That is, multi-scale information is acquired, multi-scale features of the low-level feature map and the high-level feature map are extracted, and a feature map with the multi-scale information is acquired. As shown in fig. 3, specifically:

firstly, compressing channels for a low-level feature map and a high-level feature map through four 1 × 1 convolution operations to obtain compressed feature maps;

then, convolving the compressed low-level feature map and the high-level feature map by four holes with different expansion rates, wherein d in the graph 3 represents the expansion convolution rates which are respectively 1, 2, 3 and 4, so as to extract the multi-scale feature map;

and finally, splicing the extracted multi-scale feature maps according to a channel splicing method to obtain an enhanced low-level feature map and an enhanced high-level feature map.

Multi-scale information of the image can be extracted through scale perception, and the extracted low-level feature map and the high-level feature map are enhanced.

And 4, extracting the global context information of the enhanced low-level feature graph and the enhanced high-level feature graph to obtain the low-level feature graph and the high-level feature graph which are integrated with the global context information. The global context information is aggregated, the feature graph with the multi-scale information is subjected to global context information extraction, and the global context features are aggregated to each pixel through capturing of the spatial information and the channel information, so that the global context is more effectively modeled, and the feature graph with the context information is obtained. As shown in fig. 4, the device specifically includes three parts: (1) context modeling, (2) feature transformation, and (3) feature fusion.

(1) Context modeling: firstly, the feature map X and W are combined _k Performing a convolution operation, W _k A linear transformation matrix representing a 1 × 1 convolution; then performing softmax operation, and normalizing the attention weight value through a softmax function; at the same time, reshape operation is performed on the feature map X, and then matrix multiplication is performed on the feature map X and the normalized attention weight, which is shown in FIG. 4

Multiplying the representative matrices to obtain a global context feature; wherein, the feature map X is an enhanced low-level feature map or a high-level feature map;

(2) feature conversion: global context feature and W _v1 Performing a convolution operation, W as shown in FIG. 4 _v1 A linear transformation matrix representing a 1 × 1 convolution; after the convolution operation, LayerNorm and Relu operation are sequentially carried out, so that the performance can be improved, and the network can be optimized; finally, a 1 × 1 convolution (W) is performed _v2 ) Completing feature conversion; reducing the number of parameters from C.C to 2. C.C/r in the feature conversion process, wherein r is a dimension attenuation ratio, C/r represents a hidden representation dimension of the dimension, and r is generally set to 16; the characteristic conversion can capture the dependency relationship among the channels to obtain the importance degree of each channel; the feature transformation is represented as:

δ(·)＝W _v2 RuLU(LN(W _v1 (·)))

(3) feature fusion: after the feature graph is subjected to context modeling and feature conversion, global context features are aggregated to each position of an original feature graph X through broadcast element addition operation, so that each position i of the original feature graph can acquire global context information, namely, a low-level feature graph and a high-level feature graph which are blended with the global context information can be represented as follows:

wherein the input enhanced low-level feature map or high-level feature map is represented by X ∈ R ^C×W×H C represents the number of channels, order

N _P The number of positions for feature map X, i.e. W X H,

representing the weight of global attention.

And 5, performing up-sampling operation on the high-level feature graph integrated with the global context information to enable the high-level feature graph to be the same as the low-level feature graph integrated with the global context information in size. That is, the context information feature map from the high-level feature is upsampled to obtain a feature map with the same size as the context information feature map from the low-level feature, specifically:

the high-level feature map fused with the global context feature corresponding to the feature map of the fourth layer of VGG-16Net is subjected to up-sampling multiplied by 2 operation, and the high-level feature map fused with the global context feature corresponding to the feature map of the fifth layer is subjected to up-sampling multiplied by 4 operation, so that the obtained feature map and the low-level feature map fused with the global context feature corresponding to the feature map of the third layer can be the same in size.

And 6, fusing the three layers of feature maps together according to a channel splicing method, namely splicing the high-level feature map subjected to the upsampling operation and the low-level feature map fused with the global context information by using the channel splicing method, and obtaining a predicted density map through 1 × 1 convolution.

And 7, finally, integrating and summing the density map to obtain the predicted number of people. That is, the number of people is counted, and the predicted number of people in the image is obtained by integrating the predicted population density map.

Example 2

The embodiment provides a crowd counting system based on a global context convolutional network, which comprises:

the image acquisition module is used for acquiring a crowd image to be counted;

the characteristic extraction module is used for extracting a low-level characteristic diagram and a high-level characteristic diagram of the crowd image; the feature extraction module adopts the first five layers of VGG-16Net, and takes the feature graph output by the third layer as a low-level feature graph and the feature graphs output by the fourth layer and the fifth layer as high-level feature graphs. Given image I, the output profile of VGG-16Net can be expressed as:

f _v ＝F _vgg (I)

the characteristic information of the image can be preliminarily extracted through a backbone network of VGG-16 Net;

the scale perception module is used for carrying out scale perception on the low-level feature map and the high-level feature map to obtain an enhanced low-level feature map and an enhanced high-level feature map; multi-scale information of the image can be extracted through a scale perception module, and the extracted low-level feature map and the extracted high-level feature map are enhanced;

the global context module may be represented as:

input characteristic diagram X ∈ R ^C×W×H C represents the number of channels, order

N _P For the number of positions of the feature map, i.e. WXH, use

Weight representing global attention, δ (·) W _v2 RuLU(LN(W _v1 (·))) represents a feature transformation;

Example 3

The present embodiment provides a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are adapted to be loaded by a processor of a terminal device and execute the population counting method based on the global context convolutional network.

Example 4

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A crowd counting method based on a global context convolutional network is characterized in that: the method comprises the following steps:

acquiring a crowd image to be counted;

sequentially carrying out context modeling and feature conversion on the enhanced low-level feature graph and the enhanced high-level feature graph, extracting global context features, and obtaining the low-level feature graph and the high-level feature graph which are blended into global context information through feature fusion; the feature fusion aggregates global context features to each position of the enhanced low-level feature graph and the enhanced high-level feature graph through broadcast element addition operation, so that each position can obtain global context information to obtain the low-level feature graph and the high-level feature graph which are fused with the global context information;

determining a density map according to the low-level feature map and the high-level feature map merged into the global context information; the method comprises the following specific steps: carrying out up-sampling operation on the high-level feature graph integrated with the global context information; performing splicing operation and convolution operation on the high-level feature graph subjected to the upsampling operation and the low-level feature graph integrated with the global context information to obtain a density graph;

population counts were made from the density map.

2. The population counting method based on the global context convolutional network as claimed in claim 1, wherein: the specific steps of conducting scale perception on the low-level feature map and the high-level feature map to obtain the enhanced low-level feature map and the enhanced high-level feature map comprise:

extracting a multi-scale characteristic diagram from the compressed low-level characteristic diagram and the high-level characteristic diagram through convolution of four cavities with different expansion rates;

3. The population counting method based on the global context convolutional network as claimed in claim 1, wherein: the specific steps of the context modeling are as follows:

performing convolution operation on the characteristic graph and the linear transformation matrix, and normalizing the attention weight value through a softmax function to obtain a normalized attention weight value;

4. The population counting method based on the global context convolutional network as claimed in claim 3, wherein: the specific steps of the feature conversion include:

5. The population counting method based on the global context convolutional network as claimed in claim 1, wherein: the population counting according to the density map specifically comprises the following steps: the predicted population is obtained by integrating and summing the density maps.

6. A crowd counting system based on a global context convolutional network is characterized in that: the method comprises the following steps:

the image acquisition module is used for acquiring a crowd image to be counted;

the global context module is used for sequentially carrying out context modeling and feature conversion on the enhanced low-level feature graph and the enhanced high-level feature graph, extracting global context features and obtaining the low-level feature graph and the high-level feature graph which are blended with global context information through feature fusion; the feature fusion aggregates global context features to each position of the enhanced low-level feature graph and the enhanced high-level feature graph through broadcast element addition operation, so that each position can obtain global context information to obtain the low-level feature graph and the high-level feature graph which are fused with the global context information;

the density map determining module is used for determining a density map according to the low-level feature map and the high-level feature map which are blended into the global context information; the method comprises the following specific steps: carrying out up-sampling operation on the high-level feature graph integrated with the global context information; performing splicing operation and convolution operation on the high-level feature graph subjected to the upsampling operation and the low-level feature graph fused with the global context information to obtain a density graph;

7. A computer-readable storage medium characterized by: a plurality of instructions stored therein, the instructions being adapted to be loaded by a processor of a terminal device and to perform a global context convolutional network-based population counting method as claimed in any one of claims 1 to 5.

8. A terminal device is characterized in that: the system comprises a processor and a computer readable storage medium, wherein the processor is used for realizing instructions; a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform a global context convolutional network-based population counting method as claimed in any one of claims 1 to 5.