CN117876824B

CN117876824B - Multi-modal crowd counting model training method, system, storage medium and equipment

Info

Publication number: CN117876824B
Application number: CN202410270297.7A
Authority: CN
Inventors: 余鹰; 余家茂; 肖乾
Original assignee: East China Jiaotong University
Current assignee: Tongyou Jiangxi Network Technology Co ltd
Priority date: 2024-03-11
Filing date: 2024-03-11
Publication date: 2024-05-10
Anticipated expiration: 2044-03-11
Also published as: CN117876824A

Abstract

The invention provides a multi-mode crowd counting model training method, a system, a storage medium and equipment, wherein the method comprises the steps of obtaining and marking crowd positions of crowd scene images to obtain a plurality of groups of training images, obtaining a training set according to the plurality of groups of training images and generating a label density map; respectively obtaining feature images according to crowd scene images, and carrying out multi-mode feature fusion according to the feature images to obtain fused features; self-distilling learning is carried out on the fused characteristics, and self-distilling loss is calculated; carrying out feature aggregation and multi-scale sensing of a target area on the fused features and carrying out people number prediction to obtain a prediction density map; and calculating Bayesian loss according to the predicted density map and the label density map of the corresponding training image to further obtain the network overall loss function calculation loss, and carrying out back propagation to update model parameters to obtain the multi-mode crowd counting model. The invention can carry out high-efficiency modeling on the global context information and enhance the multi-scale information extraction capability.

Description

Multi-modal crowd counting model training method, system, storage medium and equipment

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-mode crowd counting model training method, a system, a storage medium and equipment.

Background

Crowd counting is a research hotspot in the field of intelligent visual analysis, aims at estimating the number, density and distribution of people in a scene, and has important application value in the fields of smart city construction and the like.

Most current population counting methods are based on a single modality, i.e. rely on RGB pictures or video for counting. However, when the RGB information is insufficient, for example, in a night scene, the counting effect is greatly impaired. Thus, recent studies began to explore multi-modal crowd counting, i.e., collaborative counting using Thermal sensors to generate Thermal (Thermal) images as a supplementary information when RGB information is insufficient. The current multi-mode crowd counting model is mostly realized by training, and the main idea is that a camera and a Thermal sensor are used for respectively collecting a large number of RGB images and Thermal images containing crowd information, all the RGB images and the Thermal images are paired in pairs, then each pair of RGB-Thermal (RGB-T) images is marked to manufacture a training data set, and then the data set is used for training a neural network to obtain the crowd counting model. In practical application, the shot crowd image and the thermal image are input into the model, and the predicted population can be directly output.

However, with current multi-modal population count methods: on one hand, most of the existing methods rely on Convolutional Neural Networks (CNNs) as a backbone network to train models, however, the receptive field of CNNs is limited, which is not beneficial to modeling global context information in complex scenes; on the other hand, the multi-modal crowd counting method relates to information fusion between two modalities, and the existing method generally adopts a simple self-adaptive addition fusion strategy, so that the importance degree of different modalities in the multi-modal fusion process is ignored; furthermore, the crowd counting has a large number of scale change problems, and the counting precision is seriously affected by severe scale change.

Disclosure of Invention

Based on the above, the invention aims to provide a multi-mode crowd counting model training method, a system, a storage medium and equipment, which are used for solving at least one technical problem in the prior art.

The invention provides a multi-mode crowd counting model training method, which is applied to an improved Swin Transfomer network, wherein the improved Swin transform network comprises an RGB encoder, a Thermal encoder, an RGBT fusion module, a self-distillation learning module and a multi-scale regression module, the RGB encoder and the Thermal encoder both comprise a plurality of layers of feature extraction units, and the resolution ratios of the feature extraction units of all layers are different; the RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information; the self-distillation learning module is used for promoting feature interaction between deep and shallow layers of the network so as to promote multi-mode feature fusion; the multi-scale regression module adopts standard convolution and cavity convolution with different sizes to enhance the multi-scale information extraction capability of the network; the method comprises the following steps:

The crowd scene image is obtained, the crowd scene image comprises an RGB image and a Thermal image, the crowd positions of the crowd scene image are marked to obtain a plurality of groups of training images, a training set for training a crowd counting model is built according to the plurality of groups of training images, and a label density map corresponding to each group of training images in the training set is generated;

The RGB image and the Thermal image are respectively sent to an RGB encoder and a Thermal encoder to perform feature extraction, an RGB feature map F ⁿ _RGB and a Thermal feature map F ⁿ _T of different stages are respectively obtained, wherein n is E {1,2,3,4}, the RGB feature map F ⁿ _RGB and the Thermal feature map F ⁿ _T obtained in each stage are respectively sent to an RGBT fusion module of a corresponding stage to perform multi-mode feature fusion to obtain a fused feature F ⁿ _RGBT, and the fused feature F ⁿ _RGBT comprises F ¹ _RGBT、F² _RGBT、F³ _RGBT and F ⁴ _RGBT;

Feeding F ² _RGBT、F³ _RGBT and F ⁴ _RGBT into a self-distillation learning module for self-distillation learning, and calculating self-distillation losses L ₁ ^a and L ₁ ^b; f ² _RGBT、F³ _RGBT and F ⁴ _RGBT are sent to a multi-scale regression module to perform feature aggregation, multi-scale sensing of a target area and people number prediction, and finally a predicted density map is obtained;

And calculating a Bayesian loss L _bayes according to the predicted density map and the label density map of the corresponding training image, calculating a network overall loss function L _density according to the Bayesian loss L _bayes to calculate a loss according to the network overall loss function L _density, wherein the network overall loss function L _density is a weighted combination of Bayesian loss and self-distillation loss, carrying out back propagation on the calculated loss to update model parameters, and training according to the updated model parameters to obtain a multi-model crowd counting model.

According to the training method for the multi-modal crowd counting model, the RGBT fusion module, the self-distillation learning module and the multi-scale regression module are newly added in the original Swin transform network to construct an improved Swin transform network, and the improved Swin transform network is used for training the multi-modal crowd counting model so as to solve the technical problem of the multi-modal crowd counting method in the prior art; specifically, the RGB encoder and the Thermal encoder adopt a Swin transform network as backbone networks, the RGB encoder and the Thermal encoder have unique sliding window self-attentiveness and global receptive fields, global context information can be efficiently modeled, meanwhile, the design of the sliding window can enhance information interaction among cross windows and reduce calculation cost, and the defect based on CNNs method is effectively overcome. The RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information, and distributes different attention weights for different modes through the attention mechanism, so that the difference of the different modes is better focused, and the multi-mode information fusion is promoted. The self-distillation learning module takes the fusion characteristics of the deep layers of the network as soft labels to guide the multi-modal fusion of the shallow layers of the network, so that the characteristic interaction between the deep layers and the shallow layers of the network is effectively promoted, and the multi-modal characteristic fusion is further promoted. The multi-scale regression module adopts standard convolution and cavity convolution with different sizes, so that the multi-scale information extraction capability of the network is further enhanced, and the problem of target scale change is more effectively solved.

In addition, the multi-modal crowd counting model training method provided by the invention can also have the following additional technical characteristics:

further, the steps of respectively sending the RGB image and the Thermal image to an RGB encoder and a Thermal encoder to perform feature extraction, and respectively obtaining an RGB feature map F ⁿ _RGB and a Thermal feature map F ⁿ _T in different stages include:

for a group of training images in the training set, sending the RGB images in the training set to a feature extraction unit of an RGB encoder for feature extraction to obtain feature images extracted by each layer of extraction unit, so as to obtain RGB feature images at different stages;

And for a group of training images in the training set, sending the Thermal images to a feature extraction unit of a Thermal encoder for feature extraction to obtain feature images extracted by each layer of extraction unit, so as to obtain Thermal feature images at different stages.

Further, in the step of respectively sending the RGB feature map F ⁿ _RGB and the Thermal feature map F ⁿ _T obtained in each stage to the RGBT fusion module in the corresponding stage to perform multi-modal feature fusion to obtain the fused feature F ⁿ _RGBT, the method for performing multi-modal feature fusion by the RGBT fusion module includes:

Channel splicing is carried out on the input features F ⁿ _RGB and F ⁿ _T, the F ⁿ _RGB and the F ⁿ _T after channel splicing are sliced through a channel attention and a Sigmoid activation function, and attention force diagrams of the F ⁿ _RGB and the F ⁿ _T on a channel domain are respectively obtained;

Multiplying F ⁿ _RGB and F ⁿ _T with their corresponding attention patterns to obtain And/>To accomplish feature enhancement over the channel domain;

Will be And/>Attention enhancement and channel splicing are respectively carried out on the spatial domain, then a Softmax activation function is used for activation, and slicing is carried out, so that/>, respectively, are obtainedAnd/>Attention over the spatial domain strives for, realet/>And/>Multiplying the attention attempts corresponding to the two to obtain/>, respectivelyAnd/>To accomplish the attention enhancement on the spatial domain, will eventually/>And/>And adding corresponding elements to finally obtain the fused characteristic F ⁿ _RGBT.

Further, the steps of sending the F ² _RGBT、F³ _RGBT and the F ⁴ _RGBT to the multi-scale regression module for feature aggregation and multi-scale perception of the target area and people number prediction include:

F ³ _RGBT and F ⁴ _RGBT respectively use a 2-time up-sampling and a 4-time up-sampling to enable the resolution to be matched with F ² _RGBT, and then use a 3X 3 convolution to adjust the number of characteristic channels for the three of F ² _RGBT、F³ _RGBT and F ⁴ _RGBT to enable the resolution and the number of channels to be consistent, and perform characteristic addition;

Sending the added result into a four-branch structure, wherein the four branch structures in the four-branch structure are composed of standard convolution and cavity convolution with different sizes, the convolution kernel sizes are respectively 1 multiplied by 1,3 multiplied by 3,5 multiplied by 5 and 7 multiplied by 7, and the cavity rates of the cavity convolution are respectively 1,3, 5 and 7;

And (3) channel splicing is carried out on the results of the four branches, and then the regression of the reduction and density map is carried out through a 3X 3 convolution and a 1X 1 convolution, so that the prediction result of the number of people in the image is obtained.

Further, the calculation formula of the bayesian loss L _bayes is:

；

Where N is the number of annotation points, c _n is the actual number of people per annotation point, E [ c _n ] is the expected number of people per annotation point, and L ₁ represents the L ₁ distance function.

Further, the calculation formula of the overall network loss function L _density is:

L _density=L_bayes+λ₁﹒L₁ ^a+λ₂﹒L₁ ^b, wherein λ ₁ and λ ₂ are adjustable hyper-parameters;

Wherein L ₁ ^a is the self-distillation loss between F ² _RGBT and F ⁴ _RGBT; l ₁ ^b is a self-distillation loss between F ³ _RGBT and F ⁴ _RGBT.

In another aspect, the present invention provides a multi-modal crowd count model training system, the system executing the multi-modal crowd count model training method described above, the system comprising:

The system comprises an acquisition module, a judgment module and a display module, wherein the acquisition module is used for acquiring crowd scene images, the crowd scene images comprise RGB images and Thermal images, the crowd positions of the crowd scene images are marked to obtain a plurality of groups of training images, a training set for training a crowd counting model is constructed according to the plurality of groups of training images, and a label density map corresponding to each group of training images in the training set is generated;

The feature fusion module is used for respectively sending the RGB image and the Thermal image into an RGB encoder and a Thermal encoder to perform feature extraction to respectively obtain an RGB feature map F ⁿ _RGB and a Thermal feature map F ⁿ _T in different stages, wherein n is {1,2,3,4}, respectively sending the RGB feature map F ⁿ _RGB and the Thermal feature map F ⁿ _T obtained in each stage into an RGBT fusion module in a corresponding stage to perform multi-mode feature fusion to obtain a fused feature F ⁿ _RGBT, and the fused feature F ⁿ _RGBT comprises F ¹ _RGBT、F² _RGBT、F³ _RGBT and F ⁴ _RGBT;

The prediction module is used for sending F ² _RGBT、F³ _RGBT and F ⁴ _RGBT into the self-distillation learning module to perform self-distillation learning, and calculating self-distillation losses L ₁ ^a and L ₁ ^b; f ² _RGBT、F³ _RGBT and F ⁴ _RGBT are sent to a multi-scale regression module to perform feature aggregation, multi-scale sensing of a target area and people number prediction, and finally a predicted density map is obtained;

The updating module is used for calculating a Bayesian loss L _bayes according to the predicted density map and the label density map of the corresponding training image, calculating a network overall loss function L _density according to the Bayesian loss L _bayes to calculate a loss according to the network overall loss function L _density, wherein the network overall loss function L _density is a weighted combination of Bayesian loss and self-distillation loss, carrying out back propagation on the calculated loss to update model parameters, and training according to the updated model parameters to obtain a multi-mode crowd counting model.

Another aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a multi-modal population count model training method as described above.

In another aspect, the present invention further provides a data processing apparatus, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-modal population count model training method as described above when executing the program.

Drawings

FIG. 1 is a schematic diagram of a modified Swin transducer network in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of an RGBT fusion module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a self-distillation learning module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a multi-scale regression module according to an embodiment of the present invention;

FIG. 5 is a flowchart of a training method for a multi-modal crowd count model in a first embodiment of the invention;

The invention will be further described in the following detailed description in conjunction with the above-described figures.

Detailed Description

In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Several embodiments of the invention are presented in the figures. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

In order to solve the technical problems of the multi-modal crowd counting method in the prior art, the application provides a multi-modal crowd counting model training method, a system, a storage medium and equipment.

Specifically, referring to fig. 1-4, the multi-mode crowd counting model training method of the present application is applied to an improved Swin Transfomer network, the improved Swin transform network includes an RGB encoder, a Thermal encoder, an RGBT fusion module, a self-distillation learning module, and a multi-scale regression module, the RGB encoder and the Thermal encoder each include a multi-layer feature extraction unit, and resolution ratios of different sizes are provided between the feature extraction units of each layer; the RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information; the self-distillation learning module is used for promoting feature interaction between deep and shallow layers of the network so as to promote multi-mode feature fusion; the multi-scale regression module adopts standard convolution and cavity convolution with different sizes to enhance the multi-scale information extraction capability of the network.

In the application, the RGB encoder and the Thermal encoder are composed of Swin transformers with the same structure, and comprise four stages of feature extraction units, wherein the resolution of the four stages of feature extraction units is 1/4,1/8,1/16 and 1/32 in sequence. Further, for a four stage RGBT fusion module, the four stage inputs are "F¹ _RGB、F¹ _T","F² _RGB、F² _T","F³ _RGB、F³ _T" and "F ⁴ _RGB、F⁴ _T", respectively.

In order to facilitate an understanding of the invention, several embodiments of the invention will be presented below. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Example 1

Referring to fig. 5, a method for training a multi-modal crowd count model according to a first embodiment of the invention is shown, and includes steps S101 to S104:

S101, acquiring crowd scene images, wherein the crowd scene images comprise RGB images and Thermal images, labeling crowd positions of the crowd scene images to obtain a plurality of groups of training images, constructing training sets for training a crowd counting model according to the plurality of groups of training images, and generating a label density map corresponding to each group of training images in the training sets.

In this embodiment, the formula for generating the label density map is:；

Wherein D ^GT represents a label density map, x represents coordinates of each pixel point in the training image, x _i represents a center coordinate of an ith person head in the training image, G represents a gaussian kernel function, M represents an mth person, M represents a total number of persons contained in the training image, and δ (x-x _i) is a pulse function.

S102, respectively sending the RGB image and the Thermal image into an RGB encoder and a Thermal encoder to perform feature extraction, respectively obtaining an RGB feature map F ⁿ _RGB and a Thermal feature map F ⁿ _T in different stages, wherein n is E {1,2,3,4}, and respectively sending an RGB feature map F ⁿ _RGB and a Thermal feature map F ⁿ _T obtained in each stage into an RGBT fusion module in a corresponding stage to perform multi-mode feature fusion to obtain a fused feature F ⁿ _RGBT.

In the technical scheme, the Thermal image is a Thermal mode image, and the Thermal encoder is a Thermal mode encoder. As a specific example, the fused features F ⁿ _RGBT include F ¹ _RGBT、F² _RGBT、F³ _RGBT and F ⁴ _RGBT. Further, the method for acquiring the RGB feature map F ⁿ _RGB includes: and for a group of training images in the training set, sending the RGB images in the training set to a feature extraction unit of an RGB encoder for feature extraction to obtain feature images extracted by each layer of extraction unit, so as to obtain RGB feature images at different stages.

Further, the method for acquiring the Thermal feature map F ⁿ _T includes: and for a group of training images in the training set, sending the Thermal images to a feature extraction unit of a Thermal encoder for feature extraction to obtain feature images extracted by each layer of extraction unit, so as to obtain Thermal feature images at different stages.

In this embodiment, the method for performing multi-modal feature fusion by using the RGBT fusion module includes:

Channel splicing is carried out on the input features F ⁿ _RGB and F ⁿ _T, the F ⁿ _RGB and the F ⁿ _T after channel splicing are sliced through a channel attention and a Sigmoid activation function, and attention force diagrams of the F ⁿ _RGB and the F ⁿ _T on a channel domain are respectively obtained; multiplying F ⁿ _RGB and F ⁿ _T with their corresponding attention patterns to obtain And/>To accomplish feature enhancement over the channel domain; will/>And/>Attention enhancement and channel splicing are respectively carried out on the spatial domain, then a Softmax activation function is used for activation, and slicing is carried out, so that/>, respectively, are obtainedAnd/>Attention over the spatial domain strives for, realet/>And/>Multiplying the attention attempts corresponding to the two to obtain/>, respectivelyAnd/>To accomplish the attention enhancement on the spatial domain, will eventually/>And/>And adding corresponding elements to finally obtain the fused characteristic F ⁿ _RGBT.

In this embodiment, the feature enhancement on the channel domain, the whole process can be expressed as:

；

Wherein Concat () represents a channel splicing operation, CA () represents a channel attention operation, sigmoid () represents a Sigmoid activation function, slice () represents a slicing operation, and F ⁿ _{CA_RGB} and F ⁿ _{CA_T} represent attention attempts at F ⁿ _RGB、Fⁿ _T, respectively.

Further, the fused feature F ⁿ _RGBT is obtained, and the whole process can be expressed as follows:

；

Where SA () represents the spatial attention mechanism, softmax () represents the softmax activation function, and F ⁿ _{SA_RGB} and F ⁿ _{SA_T} represent respectively And/>Is stricken in the spatial domain.

The resolution sizes of F ¹ _RGBT、F² _RGBT、F³ _RGBT and F ⁴ _RGBT output by the four stages of the RGBT fusion module are 1/4,1/8,1/16 and 1/32 respectively;

The fusion features F ² _RGBT、F³ _RGBT and F ⁴ _RGBT with the resolution of 1/8,1/16 and 1/32 are input into a self-distillation learning module to perform self-distillation learning. The fusion characteristic F ⁴ _RGBT of the network deep layer contains more sufficient global context information, so that the counting precision is higher. Therefore, the fusion feature F ⁴ _RGBT of the deep network is used as a soft label to guide the feature F ² _RGBT、F³ _RGBT of the shallow network to better perform multi-mode feature fusion, specifically, the prediction result of F ⁴ _RGBT is used as a soft label, and the loss L ₁ ^a and L ₁ ^b between F ² _RGBT and F ³ _RGBT and the soft label are calculated by using an L ₁ loss function (i.e. MAE average absolute error), and the whole process can be expressed by a formula:

；

。

S103, feeding F ² _RGBT、F³ _RGBT and F ⁴ _RGBT into a self-distillation learning module to perform self-distillation learning, and calculating self-distillation losses L ₁ ^a and L ₁ ^b; and F ² _RGBT、F³ _RGBT and F ⁴ _RGBT are sent to a multi-scale regression module to perform feature aggregation, multi-scale sensing of a target area and people number prediction, and finally a predicted density map is obtained.

Specifically, F ³ _RGBT and F ⁴ _RGBT use a 2-fold upsampling and a 4-fold upsampling respectively, so that the resolution size matches with F ² _RGBT, and then the number of characteristic channels is adjusted by using a 3×3 convolution for all three of F ² _RGBT、F³ _RGBT and F ⁴ _RGBT, so that the resolution of the three and the number of channels are kept consistent, and characteristic addition is performed; sending the added result into a four-branch structure, wherein the four branch structures in the four-branch structure are composed of standard convolution and cavity convolution with different sizes, the convolution kernel sizes are respectively 1 multiplied by 1,3 multiplied by 3,5 multiplied by 5 and 7 multiplied by 7, and the cavity rates of the cavity convolution are respectively 1,3, 5 and 7; and (3) channel splicing is carried out on the results of the four branches, and then the regression of the reduction and density map is carried out through a 3X 3 convolution and a 1X 1 convolution, so that the prediction result of the number of people in the image is obtained.

S104, calculating Bayesian loss L _bayes according to the predicted density diagram and the label density diagram of the corresponding training image, calculating a network overall loss function L _density according to the Bayesian loss L _bayes to obtain loss according to the network overall loss function L _density, wherein the network overall loss function L _density is a weighted combination of Bayesian loss and self-distillation loss, carrying out back propagation on the calculated loss to update model parameters, and training according to the updated model parameters to obtain the multi-model crowd counting model.

Specifically, the calculation formula of the bayesian loss L _bayes is: ; where N is the number of annotation points, c _n is the actual number of people per annotation point, E [ c _n ] is the expected number of people per annotation point, and L ₁ represents the L ₁ distance function. Further, the calculation formula of the overall network loss function L _density is: l _density=L_bayes+λ₁﹒L₁ ^a+λ₂﹒L₁ ^b, wherein λ ₁ and λ ₂ are adjustable super parameters, and L ₁ ^a is the self-distillation loss between F ² _RGBT and F ⁴ _RGBT; l ₁ ^b is a self-distillation loss between F ³ _RGBT and F ⁴ _RGBT.

In summary, in the training method of the multi-modal crowd counting model in the above embodiment of the present invention, an improved Swin transform network is constructed by adding an RGBT fusion module, a self-distillation learning module and a multi-scale regression module to the original Swin transform network, and the multi-modal crowd counting model is trained by the improved Swin transform network to solve the technical problem of the multi-modal crowd counting method in the prior art; specifically, the RGB encoder and the Thermal encoder adopt a Swin transform network as backbone networks, the RGB encoder and the Thermal encoder have unique sliding window self-attentiveness and global receptive fields, global context information can be efficiently modeled, meanwhile, the design of the sliding window can enhance information interaction among cross windows and reduce calculation cost, and the defect based on CNNs method is effectively overcome. The RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information, and distributes different attention weights for different modes through the attention mechanism, so that the difference of the different modes is better focused, and the multi-mode information fusion is promoted. The self-distillation learning module takes the fusion characteristics of the deep layers of the network as soft labels to guide the multi-modal fusion of the shallow layers of the network, so that the characteristic interaction between the deep layers and the shallow layers of the network is effectively promoted, and the multi-modal characteristic fusion is further promoted. The multi-scale regression module adopts standard convolution and cavity convolution with different sizes, so that the multi-scale information extraction capability of the network is further enhanced, and the problem of target scale change is more effectively solved.

Example two

A second embodiment of the present invention provides a multi-modal crowd counting model training system, the system comprising:

In summary, in the training system for the multi-modal crowd counting model in the above embodiment of the present invention, an improved Swin transform network is constructed by adding an RGBT fusion module, a self-distillation learning module and a multi-scale regression module to the original Swin transform network, and the improved Swin transform network is used to train the multi-modal crowd counting model to solve the technical problem of the multi-modal crowd counting method in the prior art; specifically, the RGB encoder and the Thermal encoder adopt a Swin transform network as backbone networks, the RGB encoder and the Thermal encoder have unique sliding window self-attentiveness and global receptive fields, global context information can be efficiently modeled, meanwhile, the design of the sliding window can enhance information interaction among cross windows and reduce calculation cost, and the defect based on CNNs method is effectively overcome. The RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information, and distributes different attention weights for different modes through the attention mechanism, so that the difference of the different modes is better focused, and the multi-mode information fusion is promoted. The self-distillation learning module takes the fusion characteristics of the deep layers of the network as soft labels to guide the multi-modal fusion of the shallow layers of the network, so that the characteristic interaction between the deep layers and the shallow layers of the network is effectively promoted, and the multi-modal characteristic fusion is further promoted. The multi-scale regression module adopts standard convolution and cavity convolution with different sizes, so that the multi-scale information extraction capability of the network is further enhanced, and the problem of target scale change is more effectively solved.

Furthermore, an embodiment of the present invention proposes a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the method in the above-mentioned embodiment.

Furthermore, an embodiment of the present invention also proposes a data processing apparatus including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method in the above embodiment when executing the program.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. The multi-mode crowd counting model training method is characterized in that the method is applied to an improved Swin Transfomer network, the improved Swin transform network comprises an RGB encoder, a Thermal encoder, an RGBT fusion module, a self-distillation learning module and a multi-scale regression module, the RGB encoder and the Thermal encoder comprise a plurality of layers of feature extraction units, and resolution ratios of different sizes are arranged among the feature extraction units of each layer; the RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information; the self-distillation learning module is used for promoting feature interaction between deep and shallow layers of the network so as to promote multi-mode feature fusion; the multi-scale regression module adopts standard convolution and cavity convolution with different sizes to enhance the multi-scale information extraction capability of the network;

The method comprises the following steps:

Calculating a Bayesian loss L _bayes according to the predicted density map and the label density map of the corresponding training image, calculating a network overall loss function L _density according to the Bayesian loss L _bayes to calculate a loss according to the network overall loss function L _density, wherein the network overall loss function L _density is a weighted combination of Bayesian loss and self-distillation loss, carrying out back propagation on the calculated loss to update model parameters, and training according to the updated model parameters to obtain a multi-mode crowd counting model;

In the step of respectively sending the RGB feature map F ⁿ _RGB and the Thermal feature map F ⁿ _T obtained in each stage to the RGBT fusion module in the corresponding stage to perform multi-modal feature fusion to obtain the fused feature F ⁿ _RGBT, the method for performing multi-modal feature fusion by the RGBT fusion module includes:

2. The method of claim 1, wherein the step of sending the RGB image and the Thermal image to an RGB encoder and a Thermal encoder for feature extraction to obtain an RGB feature map F ⁿ _RGB and a Thermal feature map F ⁿ _T in different stages, respectively, includes:

3. The method of claim 1, wherein the step of sending F ² _RGBT、F³ _RGBT and F ⁴ _RGBT to a multi-scale regression module for feature aggregation and multi-scale perception of target areas and for people prediction comprises:

4. The multi-modal population count model training method of claim 1 wherein the bayesian penalty L _bayes is calculated as:

；

5. The multi-modal population count model training method of claim 1 wherein the overall network loss function L _density is calculated as:

6. A multimodal population count model training system, wherein the system performs the multimodal population count model training method of any of the preceding claims 1-5, the system comprising:

7. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the multi-modal population count model training method of any of claims 1-5.

8. A data processing apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multimodal population count model training method of any of claims 1-5 when the program is executed by the processor.