CN116645516A

CN116645516A - Multi-category target counting method and system based on multi-perception feature fusion

Info

Publication number: CN116645516A
Application number: CN202310513969.8A
Authority: CN
Inventors: 张莉; 魏祥一; 赵雷; 王邦军
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2023-05-09
Filing date: 2023-05-09
Publication date: 2023-08-25

Abstract

The application relates to a multi-category target counting method and system based on multi-perception feature fusion, wherein the method comprises the following steps: step S1: acquiring images with multiple types of targets, and extracting feature images of the images; step S2: extracting multi-scale features and space features of the feature map, and extracting channel features of the multi-scale features; step S3: fusing the space features and the channel features to obtain fusion features; step S4: inputting the fusion characteristics into a counting network, and outputting a density map through the counting network; step S5: and integrating the density map to obtain the number of each type of targets. The application can effectively count vehicles and pedestrians, and has better effect.

Description

Multi-category target counting method and system based on multi-perception feature fusion

Technical Field

The application relates to the technical field of multi-target counting, in particular to a multi-category target counting method and system based on multi-perception feature fusion.

Background

In recent years, more and more researchers have focused on the important role of automatic crowd counting in public monitoring and intelligent transportation systems. Analysis of crowd behavior and traffic density has a significant impact on the efficiency of public transportation. In the construction of smart cities, whether an intelligent traffic system can accurately and efficiently acquire crowd behavior information and vehicle information from public monitoring is a very important link for realizing effective traffic planning, and crowd counting and vehicle counting are basic tasks of crowd behavior analysis and traffic flow analysis. For example, crowd counting has been widely used in public safety, video surveillance, and other fields. There are generally two methods to solve the target count problem, calculating the object, where the input is the image and the output is the total number of people in the image; the density map is predicted, where the input is an image, the output is a predicted population density map, and then the density map is integrated to obtain the population count.

In the early years, researchers studied the target count problem by either detection-based or regression-based methods. Detection-based methods that take the entire object in the image as a detectable feature and then easily calculate the detection result extract the manual features of the object using image processing techniques, however, this assumption is not always true in research, especially when the object is very dense and severely occluded. Regression-based methods use global features, such as texture, gradients and then machine learning methods to learn a regression model, such as support vector machine or ridge regression, to build a map between the hand-made features and the number of people in the image. The detection-based method has the problems of serious shielding, perspective and the like, the performance of the detection-based method on a high-density image is obviously and negatively influenced, the regression-based method uses global features for counting, the problems of image shielding, perspective and the like are solved, and the performance of the detection-based method on a high-density target image is poor.

With the rapid development of deep learning, more and more computer vision researchers are focusing on convolutional neural networks. The convolutional neural network (Convolutional neural networks, CNN) has a strong automatic feature extraction capability, and recent studies on object counting indicate that CNN has a strong automatic feature extraction capability, which is superior to all conventional methods. While conventional object counting methods typically output the population in an image, CNN-based methods typically predict density maps, which contain density information that conventional methods predict that the population is not available. However, in the field of intelligent transportation, there is a constant lack of a counting dataset containing common vehicles and a method for counting common vehicles and pedestrians.

Disclosure of Invention

The technical problem to be solved by the application is to overcome the lack of a method for counting vehicles and pedestrians in the prior art.

In order to solve the technical problems, the application provides a multi-category target counting method based on multi-perception feature fusion, which comprises the following steps:

step S1: acquiring images with multiple types of targets, and extracting feature images of the images;

step S2: extracting multi-scale features and space features of the feature map, and extracting channel features of the multi-scale features;

step S3: fusing the space features and the channel features to obtain fusion features;

step S4: inputting the fusion characteristics into a counting network, and outputting a density map through the counting network;

step S5: and integrating the density map to obtain the number of each type of targets.

In one embodiment of the present application, the step S1 extracts a feature map of the image through a convolutional neural network;

the neural network comprises a first convolution unit, a second convolution unit, a third convolution unit and a fourth convolution unit which are sequentially connected;

wherein a pooling layer is arranged between the first convolution unit and the second convolution unit, between the second convolution unit and the third convolution unit and between the third convolution unit and the fourth convolution unit;

the first convolution unit comprises two convolution layers with the channel number of 64, the second convolution unit comprises two convolution layers with the channel number of 128, the third convolution unit comprises three convolution layers with the channel number of 256, and the third convolution unit comprises three convolution layers with the channel number of 512.

In one embodiment of the present application, the step S2 extracts multi-scale features of the feature map, and the method includes:

map the characteristic diagram F ₀ Inputting a multi-scale feature extraction network, and extracting the feature map F through the multi-scale feature extraction network ₀ The four features with different scales are spliced to obtain a multi-scale feature F;

wherein the multi-scale feature extraction network comprises four parallel convolution units;

the first convolution unit comprises a first expanded convolution layer, and an average pooling layer is arranged in front of the first expanded convolution layer;

the second convolution unit comprises a second expansion convolution layer, a third expansion convolution layer and a fourth expansion convolution layer which are sequentially connected;

the third convolution unit comprises a fifth expansion convolution layer and a sixth expansion convolution layer which are sequentially connected;

the fourth convolution unit includes a seventh expanded convolution layer.

In one embodiment of the present application, the method of extracting the channel features of the multi-scale feature in the step S2 includes:

extracting initial channel features C (F) of the multi-scale features through a channel feature extraction network _initial ；

Characterizing the initial channel C (F) _initial Element multiplication is carried out on the multi-scale features to obtain channel features C (F);

wherein initial channel features C (F) of the multi-scale features are extracted by a channel feature extraction network _initial The formula is:

C(F) _initial ＝concat(σ(L(r(L(v,W ₀ ),W ₁ )))(F))

wherein F represents multi-scale features, concat represents matrix stitching, sigma represents sigmoid function, v represents feature vector passing through average pooling layer, and W ₀ And W is ₁ Parameters of two linear layers, L is the linear layer and r is the ReLU activation function.

In one embodiment of the present application, the step S2 extracts spatial features of the feature map, and the method includes:

extracting initial spatial features S (F) of the feature map by a spatial feature extraction network ₀ ) _initial ；

The initial channel characteristics S (F ₀ ) _initial Element multiplication is carried out on the characteristic diagram to obtain a spatial characteristic S (F ₀ )；

Wherein initial spatial features S (F) of the feature map are extracted by a spatial feature extraction network ₀ ) _initial The formula is:

S(F ₀ ) _initial ＝σ(Conv(MAP(F ₀ ),θ))⊙F ₀

wherein F is ₀ Representing the feature MAP, σ representing the sigmoid function, conv representing the 7×7 convolution layer, MAP representing the maximum pooling and average pooling, θ representing the parameters of the spatial feature extraction network, and # representing multiplication by element.

In one embodiment of the present application, in the step S3, the spatial feature and the channel feature are fused to obtain a fused feature, where the formula is:

Output _MA ＝C(F)+S(F ₀ )

wherein C (F) represents channel characteristics, S (F) ₀ ) Representing the spatial features.

In one embodiment of the present application, the counting network in the step S4 includes five convolution layers connected in sequence, wherein the third convolution layer is a deconvolution layer; the fifth convolution layer has a convolution kernel size of 1 x 1 for outputting the density map.

In order to solve the technical problems, the application provides a multi-category target counting system based on multi-perception feature fusion, which comprises:

a first extraction module: the method comprises the steps of acquiring images with multiple types of targets, and extracting feature images of the images;

and a second extraction module: the method comprises the steps of extracting multi-scale features and space features of the feature map, and extracting channel features of the multi-scale features;

and a fusion module: the method comprises the steps of fusing the space characteristics and the channel characteristics to obtain fusion characteristics;

and a density map construction module: the fusion feature is input into a counting network, and a density map is output through the counting network;

the quantity prediction module: for obtaining the number of objects of each type by integrating the density map.

In order to solve the technical problems, the application provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the multi-category target counting method based on multi-perception feature fusion when executing the computer program.

To solve the above technical problem, the present application provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the multi-class target counting method based on multi-perception feature fusion.

Compared with the prior art, the technical scheme of the application has the following advantages:

the application can realize effective extraction and fusion of the features by constructing the convolutional neural network, the multi-scale feature extraction network, the channel feature extraction network and the space feature extraction network, and finally realize effective counting of common vehicles and pedestrians;

the application can provide important support for the public monitoring and intelligent transportation fields.

Drawings

In order that the application may be more readily understood, a more particular description of the application will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.

FIG. 1 is a flow chart of the method of the present application;

FIG. 2 is a schematic diagram of a convolutional neural network in an embodiment of the present application;

FIG. 3 is a schematic diagram of a multi-scale feature extraction network in accordance with an embodiment of the application;

FIG. 4 is a schematic diagram of a channel feature extraction network according to an embodiment of the present application;

fig. 5 is a schematic diagram of a spatial feature extraction network according to an embodiment of the present application.

Detailed Description

The present application will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the application and practice it.

Example 1

Referring to fig. 1, the application relates to a multi-category target counting method based on multi-perception feature fusion, which comprises the following steps:

The present embodiment is described in detail below:

in this embodiment, a large-scale multi-category target counting dataset is collected and labeled, and the dataset is divided into 8 categories, which totally comprise 2521 images, 274199 labeling points, and are suitable for verifying the performance of the present application, and the specific implementation steps are as follows:

in the research, the problem of unbalanced categories is generally found to be needed to be considered in multi-category counting, so that a category self-adaptive weight distribution loss function is designed in the embodiment, so that model loss is reduced in the model training process, and model accuracy is improved. The class adaptive weight allocation loss function formula is as follows:

wherein m is the class number, n is the number of test samples, D _ij Andrespectively, are image X _i True and estimated density map with j-th class, m _p Is a model parameter of the entire network. />And T _ij Respectively representing predicted values X of class j _i And actual truth count, γ is the weight of a conditional reflection difficulty sample on total loss. In the experiment, this example was empirically set to 0.01 during training.

Referring to fig. 2, step S1 extracts a feature map of the image through a convolutional neural network, where the convolutional neural network according to the embodiment shows an excellent local feature extraction capability due to the use of convolution and has more layers with small convolution kernels, and the neural network includes a first convolution unit, a second convolution unit, a third convolution unit, and a fourth convolution unit connected in sequence; wherein a pooling layer is arranged between the first convolution unit and the second convolution unit, between the second convolution unit and the third convolution unit and between the third convolution unit and the fourth convolution unit; the first convolution unit comprises two convolution layers with the channel number of 64, the second convolution unit comprises two convolution layers with the channel number of 128, the third convolution unit comprises three convolution layers with the channel number of 256, and the third convolution unit comprises three convolution layers with the channel number of 512. The characteristic diagram of the input image output through the convolutional neural network is 1/8 of the original input size.

Further, in order to solve the problem that the size and shape of the transmission object in the multi-class object counting task are large in difference, the embodiment constructs a multi-scale feature extraction network, which is composed of four groups of convolution branches with different perception fields and expansion rates, and the embodiment uses dilation convolution (namely hole convolution) to perform multi-scale feature perception. The dilation convolution can expand the perceptual field while maintaining spatial resolution. Unlike the multi-column convolution structure typically used in counting tasks, the present embodiment also designs an average pooling layer in one convolution branch to improve the performance of the model.

Specifically, in step S2, the multi-scale features of the feature map are extracted, and the method includes: inputting the feature map into a multi-scale feature extraction network, extracting four features with different scales of the feature map through the multi-scale feature extraction network, and splicing the four features with different scales to obtain multi-scale features; referring to fig. 3, the multi-scale feature extraction network includes four parallel convolution units; the first convolution unit comprises a first expanded convolution (k=1, d=1, p=1) and an average pooling layer is provided before the first expanded convolution layer; the second convolution unit comprises a second expansion convolution layer (K=1, D=1, P=1), a third expansion convolution layer (K=3, D=3, P=3) and a fourth expansion convolution layer (K=3, D=3, P=3) which are connected in sequence; the third convolution unit includes a fifth expanded convolution layer (k=1, d=1, p=1) and a sixth expanded convolution layer (k=3, d=2, p=2) connected in sequence; the third convolution unit includes a seventh expanded convolution layer (k=1, d=1, p=1).

Further, in the task of counting multiple types of targets, the feature mapping to be focused by the model is relatively complex, so that the embodiment introduces a channel feature extraction network, and extracts the input basic features by using the feature detector corresponding to each channel in the feature mapping.

Specifically, in step S2, channel features of the multi-scale feature are extracted, and the method includes: extracting initial channel features C (F) of the multi-scale features through a channel feature extraction network _initial The method comprises the steps of carrying out a first treatment on the surface of the Characterizing the initial channel C (F) _initial And the multiscale featureMultiplying the row elements to obtain a channel characteristic C (F); wherein initial channel features C (F) of the multi-scale features are extracted by a channel feature extraction network _initial The formula is:

C(F) _initial ＝concat(σ(L(r(L(v,W ₀ ),W ₁ )))(F))

The principle of the formula is as follows: the channel feature extraction network aggregates the spatial information of the multi-scale features F through an average pooling layer to generate different spatial context descriptors, obtains channel feature mapping v, and sorts the importance of the feature mapping v through two linear layers. To limit the complexity of the channel feature extraction network, the present embodiment encodes channel feature vectors by forming a bottleneck consisting of two linear layers around the nonlinearity. Then, the encoding channel feature vector mapped to [0,1] is normalized by Sigmoid operation.

Referring to fig. 4, the channel feature extraction network in this embodiment includes an average pooling layer, a first Linear layer (Linear), a ReLU activation function, a second Linear layer (Linear), and a sigmoid function, which are sequentially connected.

Specifically, in step S2, spatial features of the feature map are extracted, and the method includes: extracting initial spatial features S (F) of the feature map by a spatial feature extraction network ₀ ) _initial The method comprises the steps of carrying out a first treatment on the surface of the The initial channel characteristics S (F ₀ ) _initial Element multiplication is carried out on the characteristic diagram to obtain a spatial characteristic S (F ₀ ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein initial spatial features S (F) of the feature map are extracted by a spatial feature extraction network ₀ ) _initial The formula is:

S(F ₀ ) _initial ＝σ(Conv(MAP(F ₀ ),θ))⊙F ₀

wherein F is ₀ Representing a feature MAP, σ represents a sigmoid function, conv represents a 7×7 convolution layer, and MAP represents a maximum pool and an average poolBy the expression, θ represents the parameters of the spatial feature extraction network, and by the expression of element-wise multiplication.

The principle of the formula is as follows: the role of the spatial feature extraction network is to discover the critical parts of the network and to weight the space. Not all regions in an image are equally important in the contribution of a task, only regions relevant to the task are of interest, and the role of the spatial feature extraction network is to find the essential part of the network. The input to the spatial feature extraction network is a feature map F ₀ . The present embodiment obtains the results of maximum pooling and average pooling, concatenates them into one feature map, learns using the convolutional layer, and then operates using Sigmod.

Referring to fig. 5, the spatial feature extraction network in this embodiment includes a max-pooling and average-pooling layer, a convolution layer with k=7 and p=3, and a sigmoid function connected in sequence.

Further, in step S3, the spatial feature and the channel feature are fused to obtain a fusion feature, where the formula is:

Output _MA ＝C(F)+S(F ₀ )

Further, the counting network in step S4 includes five sequentially connected convolution layers, wherein the first, second and fourth convolution layer parameters are (k=3, d=1, p=1), and the third convolution layer is a deconvolution layer (k= 9,S =2, p=1); the fifth convolution layer has a convolution kernel size of 1 x 1 for outputting the density map.

The experimental comparison results are as follows:

in order to verify the performance of the present application, the present embodiment uses the method provided by the present application to perform experiments on multiple types of target count data sets collected by the present application, and compared with the existing method in a large scale, the present application has better effects as shown in table 1 by using the Mean Absolute Error (MAE) and Mean Square Error (MSE) experimental results.

Table 1 comparison of experimental results

Method	MAE	MSE
			C_CNN	26.94	50.69
MCNN	28.31	56.25
			CSRNet	32.10	52.86
CANet	26.82	47.86
			Res101_SFCN	31.16	54.02
SCAR	27.50	48.05
			DM-Count	35.28	61.62
The application is that	26.03	46.11

Example two

The embodiment provides a multi-category target counting system based on multi-perception feature fusion, which comprises:

Example III

The present embodiment provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the multi-category target counting method based on multi-perception feature fusion of embodiment one when executing the computer program.

Example IV

The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the multi-class object counting method based on multi-perceptual feature fusion of embodiment.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present application will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present application.

Claims

1. A multi-category target counting method based on multi-perception feature fusion is characterized by comprising the following steps of: comprising the following steps:

2. The multi-class object counting method based on multi-perception feature fusion according to claim 1, wherein: step S1, extracting a feature map of the image through a convolutional neural network;

3. The multi-class object counting method based on multi-perception feature fusion according to claim 1, wherein: the step S2 is to extract multi-scale features of the feature map, and the method comprises the following steps:

the fourth convolution unit includes a seventh expanded convolution layer.

4. The multi-class object counting method based on multi-perception feature fusion according to claim 4, wherein: the step S2 is to extract the channel characteristics of the multi-scale characteristics, and the method comprises the following steps:

Characterizing the initial channel C (F) _initial Element multiplication is carried out on the multi-scale characteristics to obtain a general resultTrace feature C (F);

C(F) _initial ＝concat(σ(L(r(L(v,W ₀ ),W ₁ )))(F))

wherein F represents multi-scale features, concat represents matrix stitching, sigma represents sigmoid function, v represents feature vector passing through average pooling layer, and W ₀ And W is ₁ Parameters of two fully connected layers, L is a linear layer and r is a ReLU activation function.

5. The multi-class object counting method based on multi-perception feature fusion according to claim 1, wherein: the step S2 extracts the spatial features of the feature map, and the method includes:

S(F ₀ ) _initial ＝σ(Conv(MAP(F ₀ ),θ))⊙F ₀

6. The multi-class object counting method based on multi-perception feature fusion according to claim 1, wherein: in the step S3, the spatial feature and the channel feature are fused to obtain a fusion feature, where the formula is:

Output _MA ＝C(F)+S(F ₀ )

7. The multi-class object counting method based on multi-perception feature fusion according to claim 1, wherein: the counting network in the step S4 comprises five convolution layers which are connected in sequence, wherein a third convolution layer is a deconvolution layer; the fifth convolution layer has a convolution kernel size of 1 x 1 for outputting the density map.

8. A multi-category target counting system based on multi-perception feature fusion is characterized in that: comprising the following steps:

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements the steps of the multi-class object counting method based on multi-perceptual feature fusion as defined in any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program, when executed by a processor, implements the steps of the multi-class object counting method based on multi-perceptual feature fusion as defined in any one of claims 1 to 7.