CN117876824B - Multi-modal crowd counting model training method, system, storage medium and equipment - Google Patents

Multi-modal crowd counting model training method, system, storage medium and equipment Download PDF

Info

Publication number
CN117876824B
CN117876824B CN202410270297.7A CN202410270297A CN117876824B CN 117876824 B CN117876824 B CN 117876824B CN 202410270297 A CN202410270297 A CN 202410270297A CN 117876824 B CN117876824 B CN 117876824B
Authority
CN
China
Prior art keywords
rgbt
rgb
feature
training
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410270297.7A
Other languages
Chinese (zh)
Other versions
CN117876824A (en
Inventor
余鹰
余家茂
肖乾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongyou Jiangxi Network Technology Co ltd
Original Assignee
East China Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Jiaotong University filed Critical East China Jiaotong University
Priority to CN202410270297.7A priority Critical patent/CN117876824B/en
Publication of CN117876824A publication Critical patent/CN117876824A/en
Application granted granted Critical
Publication of CN117876824B publication Critical patent/CN117876824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a multi-mode crowd counting model training method, a system, a storage medium and equipment, wherein the method comprises the steps of obtaining and marking crowd positions of crowd scene images to obtain a plurality of groups of training images, obtaining a training set according to the plurality of groups of training images and generating a label density map; respectively obtaining feature images according to crowd scene images, and carrying out multi-mode feature fusion according to the feature images to obtain fused features; self-distilling learning is carried out on the fused characteristics, and self-distilling loss is calculated; carrying out feature aggregation and multi-scale sensing of a target area on the fused features and carrying out people number prediction to obtain a prediction density map; and calculating Bayesian loss according to the predicted density map and the label density map of the corresponding training image to further obtain the network overall loss function calculation loss, and carrying out back propagation to update model parameters to obtain the multi-mode crowd counting model. The invention can carry out high-efficiency modeling on the global context information and enhance the multi-scale information extraction capability.

Description

Multi-modal crowd counting model training method, system, storage medium and equipment
Technical Field
The invention relates to the technical field of computer vision, in particular to a multi-mode crowd counting model training method, a system, a storage medium and equipment.
Background
Crowd counting is a research hotspot in the field of intelligent visual analysis, aims at estimating the number, density and distribution of people in a scene, and has important application value in the fields of smart city construction and the like.
Most current population counting methods are based on a single modality, i.e. rely on RGB pictures or video for counting. However, when the RGB information is insufficient, for example, in a night scene, the counting effect is greatly impaired. Thus, recent studies began to explore multi-modal crowd counting, i.e., collaborative counting using Thermal sensors to generate Thermal (Thermal) images as a supplementary information when RGB information is insufficient. The current multi-mode crowd counting model is mostly realized by training, and the main idea is that a camera and a Thermal sensor are used for respectively collecting a large number of RGB images and Thermal images containing crowd information, all the RGB images and the Thermal images are paired in pairs, then each pair of RGB-Thermal (RGB-T) images is marked to manufacture a training data set, and then the data set is used for training a neural network to obtain the crowd counting model. In practical application, the shot crowd image and the thermal image are input into the model, and the predicted population can be directly output.
However, with current multi-modal population count methods: on one hand, most of the existing methods rely on Convolutional Neural Networks (CNNs) as a backbone network to train models, however, the receptive field of CNNs is limited, which is not beneficial to modeling global context information in complex scenes; on the other hand, the multi-modal crowd counting method relates to information fusion between two modalities, and the existing method generally adopts a simple self-adaptive addition fusion strategy, so that the importance degree of different modalities in the multi-modal fusion process is ignored; furthermore, the crowd counting has a large number of scale change problems, and the counting precision is seriously affected by severe scale change.
Disclosure of Invention
Based on the above, the invention aims to provide a multi-mode crowd counting model training method, a system, a storage medium and equipment, which are used for solving at least one technical problem in the prior art.
The invention provides a multi-mode crowd counting model training method, which is applied to an improved Swin Transfomer network, wherein the improved Swin transform network comprises an RGB encoder, a Thermal encoder, an RGBT fusion module, a self-distillation learning module and a multi-scale regression module, the RGB encoder and the Thermal encoder both comprise a plurality of layers of feature extraction units, and the resolution ratios of the feature extraction units of all layers are different; the RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information; the self-distillation learning module is used for promoting feature interaction between deep and shallow layers of the network so as to promote multi-mode feature fusion; the multi-scale regression module adopts standard convolution and cavity convolution with different sizes to enhance the multi-scale information extraction capability of the network; the method comprises the following steps:
The crowd scene image is obtained, the crowd scene image comprises an RGB image and a Thermal image, the crowd positions of the crowd scene image are marked to obtain a plurality of groups of training images, a training set for training a crowd counting model is built according to the plurality of groups of training images, and a label density map corresponding to each group of training images in the training set is generated;
The RGB image and the Thermal image are respectively sent to an RGB encoder and a Thermal encoder to perform feature extraction, an RGB feature map F n RGB and a Thermal feature map F n T of different stages are respectively obtained, wherein n is E {1,2,3,4}, the RGB feature map F n RGB and the Thermal feature map F n T obtained in each stage are respectively sent to an RGBT fusion module of a corresponding stage to perform multi-mode feature fusion to obtain a fused feature F n RGBT, and the fused feature F n RGBT comprises F 1 RGBT、F2 RGBT、F3 RGBT and F 4 RGBT;
Feeding F 2 RGBT、F3 RGBT and F 4 RGBT into a self-distillation learning module for self-distillation learning, and calculating self-distillation losses L 1 a and L 1 b; f 2 RGBT、F3 RGBT and F 4 RGBT are sent to a multi-scale regression module to perform feature aggregation, multi-scale sensing of a target area and people number prediction, and finally a predicted density map is obtained;
And calculating a Bayesian loss L bayes according to the predicted density map and the label density map of the corresponding training image, calculating a network overall loss function L density according to the Bayesian loss L bayes to calculate a loss according to the network overall loss function L density, wherein the network overall loss function L density is a weighted combination of Bayesian loss and self-distillation loss, carrying out back propagation on the calculated loss to update model parameters, and training according to the updated model parameters to obtain a multi-model crowd counting model.
According to the training method for the multi-modal crowd counting model, the RGBT fusion module, the self-distillation learning module and the multi-scale regression module are newly added in the original Swin transform network to construct an improved Swin transform network, and the improved Swin transform network is used for training the multi-modal crowd counting model so as to solve the technical problem of the multi-modal crowd counting method in the prior art; specifically, the RGB encoder and the Thermal encoder adopt a Swin transform network as backbone networks, the RGB encoder and the Thermal encoder have unique sliding window self-attentiveness and global receptive fields, global context information can be efficiently modeled, meanwhile, the design of the sliding window can enhance information interaction among cross windows and reduce calculation cost, and the defect based on CNNs method is effectively overcome. The RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information, and distributes different attention weights for different modes through the attention mechanism, so that the difference of the different modes is better focused, and the multi-mode information fusion is promoted. The self-distillation learning module takes the fusion characteristics of the deep layers of the network as soft labels to guide the multi-modal fusion of the shallow layers of the network, so that the characteristic interaction between the deep layers and the shallow layers of the network is effectively promoted, and the multi-modal characteristic fusion is further promoted. The multi-scale regression module adopts standard convolution and cavity convolution with different sizes, so that the multi-scale information extraction capability of the network is further enhanced, and the problem of target scale change is more effectively solved.
In addition, the multi-modal crowd counting model training method provided by the invention can also have the following additional technical characteristics:
further, the steps of respectively sending the RGB image and the Thermal image to an RGB encoder and a Thermal encoder to perform feature extraction, and respectively obtaining an RGB feature map F n RGB and a Thermal feature map F n T in different stages include:
for a group of training images in the training set, sending the RGB images in the training set to a feature extraction unit of an RGB encoder for feature extraction to obtain feature images extracted by each layer of extraction unit, so as to obtain RGB feature images at different stages;
And for a group of training images in the training set, sending the Thermal images to a feature extraction unit of a Thermal encoder for feature extraction to obtain feature images extracted by each layer of extraction unit, so as to obtain Thermal feature images at different stages.
Further, in the step of respectively sending the RGB feature map F n RGB and the Thermal feature map F n T obtained in each stage to the RGBT fusion module in the corresponding stage to perform multi-modal feature fusion to obtain the fused feature F n RGBT, the method for performing multi-modal feature fusion by the RGBT fusion module includes:
Channel splicing is carried out on the input features F n RGB and F n T, the F n RGB and the F n T after channel splicing are sliced through a channel attention and a Sigmoid activation function, and attention force diagrams of the F n RGB and the F n T on a channel domain are respectively obtained;
Multiplying F n RGB and F n T with their corresponding attention patterns to obtain And/>To accomplish feature enhancement over the channel domain;
Will be And/>Attention enhancement and channel splicing are respectively carried out on the spatial domain, then a Softmax activation function is used for activation, and slicing is carried out, so that/>, respectively, are obtainedAnd/>Attention over the spatial domain strives for, realet/>And/>Multiplying the attention attempts corresponding to the two to obtain/>, respectivelyAnd/>To accomplish the attention enhancement on the spatial domain, will eventually/>And/>And adding corresponding elements to finally obtain the fused characteristic F n RGBT.
Further, the steps of sending the F 2 RGBT、F3 RGBT and the F 4 RGBT to the multi-scale regression module for feature aggregation and multi-scale perception of the target area and people number prediction include:
F 3 RGBT and F 4 RGBT respectively use a 2-time up-sampling and a 4-time up-sampling to enable the resolution to be matched with F 2 RGBT, and then use a 3X 3 convolution to adjust the number of characteristic channels for the three of F 2 RGBT、F3 RGBT and F 4 RGBT to enable the resolution and the number of channels to be consistent, and perform characteristic addition;
Sending the added result into a four-branch structure, wherein the four branch structures in the four-branch structure are composed of standard convolution and cavity convolution with different sizes, the convolution kernel sizes are respectively 1 multiplied by 1,3 multiplied by 3,5 multiplied by 5 and 7 multiplied by 7, and the cavity rates of the cavity convolution are respectively 1,3, 5 and 7;
And (3) channel splicing is carried out on the results of the four branches, and then the regression of the reduction and density map is carried out through a 3X 3 convolution and a 1X 1 convolution, so that the prediction result of the number of people in the image is obtained.
Further, the calculation formula of the bayesian loss L bayes is:
Where N is the number of annotation points, c n is the actual number of people per annotation point, E [ c n ] is the expected number of people per annotation point, and L 1 represents the L 1 distance function.
Further, the calculation formula of the overall network loss function L density is:
L density=Lbayes1﹒L1 a2﹒L1 b, wherein λ 1 and λ 2 are adjustable hyper-parameters;
Wherein L 1 a is the self-distillation loss between F 2 RGBT and F 4 RGBT; l 1 b is a self-distillation loss between F 3 RGBT and F 4 RGBT.
In another aspect, the present invention provides a multi-modal crowd count model training system, the system executing the multi-modal crowd count model training method described above, the system comprising:
The system comprises an acquisition module, a judgment module and a display module, wherein the acquisition module is used for acquiring crowd scene images, the crowd scene images comprise RGB images and Thermal images, the crowd positions of the crowd scene images are marked to obtain a plurality of groups of training images, a training set for training a crowd counting model is constructed according to the plurality of groups of training images, and a label density map corresponding to each group of training images in the training set is generated;
The feature fusion module is used for respectively sending the RGB image and the Thermal image into an RGB encoder and a Thermal encoder to perform feature extraction to respectively obtain an RGB feature map F n RGB and a Thermal feature map F n T in different stages, wherein n is {1,2,3,4}, respectively sending the RGB feature map F n RGB and the Thermal feature map F n T obtained in each stage into an RGBT fusion module in a corresponding stage to perform multi-mode feature fusion to obtain a fused feature F n RGBT, and the fused feature F n RGBT comprises F 1 RGBT、F2 RGBT、F3 RGBT and F 4 RGBT;
The prediction module is used for sending F 2 RGBT、F3 RGBT and F 4 RGBT into the self-distillation learning module to perform self-distillation learning, and calculating self-distillation losses L 1 a and L 1 b; f 2 RGBT、F3 RGBT and F 4 RGBT are sent to a multi-scale regression module to perform feature aggregation, multi-scale sensing of a target area and people number prediction, and finally a predicted density map is obtained;
The updating module is used for calculating a Bayesian loss L bayes according to the predicted density map and the label density map of the corresponding training image, calculating a network overall loss function L density according to the Bayesian loss L bayes to calculate a loss according to the network overall loss function L density, wherein the network overall loss function L density is a weighted combination of Bayesian loss and self-distillation loss, carrying out back propagation on the calculated loss to update model parameters, and training according to the updated model parameters to obtain a multi-mode crowd counting model.
Another aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a multi-modal population count model training method as described above.
In another aspect, the present invention further provides a data processing apparatus, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-modal population count model training method as described above when executing the program.
Drawings
FIG. 1 is a schematic diagram of a modified Swin transducer network in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of an RGBT fusion module according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a self-distillation learning module according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a multi-scale regression module according to an embodiment of the present invention;
FIG. 5 is a flowchart of a training method for a multi-modal crowd count model in a first embodiment of the invention;
The invention will be further described in the following detailed description in conjunction with the above-described figures.
Detailed Description
In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Several embodiments of the invention are presented in the figures. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
In order to solve the technical problems of the multi-modal crowd counting method in the prior art, the application provides a multi-modal crowd counting model training method, a system, a storage medium and equipment.
Specifically, referring to fig. 1-4, the multi-mode crowd counting model training method of the present application is applied to an improved Swin Transfomer network, the improved Swin transform network includes an RGB encoder, a Thermal encoder, an RGBT fusion module, a self-distillation learning module, and a multi-scale regression module, the RGB encoder and the Thermal encoder each include a multi-layer feature extraction unit, and resolution ratios of different sizes are provided between the feature extraction units of each layer; the RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information; the self-distillation learning module is used for promoting feature interaction between deep and shallow layers of the network so as to promote multi-mode feature fusion; the multi-scale regression module adopts standard convolution and cavity convolution with different sizes to enhance the multi-scale information extraction capability of the network.
In the application, the RGB encoder and the Thermal encoder are composed of Swin transformers with the same structure, and comprise four stages of feature extraction units, wherein the resolution of the four stages of feature extraction units is 1/4,1/8,1/16 and 1/32 in sequence. Further, for a four stage RGBT fusion module, the four stage inputs are "F1 RGB、F1 T","F2 RGB、F2 T","F3 RGB、F3 T" and "F 4 RGB、F4 T", respectively.
In order to facilitate an understanding of the invention, several embodiments of the invention will be presented below. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Example 1
Referring to fig. 5, a method for training a multi-modal crowd count model according to a first embodiment of the invention is shown, and includes steps S101 to S104:
S101, acquiring crowd scene images, wherein the crowd scene images comprise RGB images and Thermal images, labeling crowd positions of the crowd scene images to obtain a plurality of groups of training images, constructing training sets for training a crowd counting model according to the plurality of groups of training images, and generating a label density map corresponding to each group of training images in the training sets.
In this embodiment, the formula for generating the label density map is:
Wherein D GT represents a label density map, x represents coordinates of each pixel point in the training image, x i represents a center coordinate of an ith person head in the training image, G represents a gaussian kernel function, M represents an mth person, M represents a total number of persons contained in the training image, and δ (x-x i) is a pulse function.
S102, respectively sending the RGB image and the Thermal image into an RGB encoder and a Thermal encoder to perform feature extraction, respectively obtaining an RGB feature map F n RGB and a Thermal feature map F n T in different stages, wherein n is E {1,2,3,4}, and respectively sending an RGB feature map F n RGB and a Thermal feature map F n T obtained in each stage into an RGBT fusion module in a corresponding stage to perform multi-mode feature fusion to obtain a fused feature F n RGBT.
In the technical scheme, the Thermal image is a Thermal mode image, and the Thermal encoder is a Thermal mode encoder. As a specific example, the fused features F n RGBT include F 1 RGBT、F2 RGBT、F3 RGBT and F 4 RGBT. Further, the method for acquiring the RGB feature map F n RGB includes: and for a group of training images in the training set, sending the RGB images in the training set to a feature extraction unit of an RGB encoder for feature extraction to obtain feature images extracted by each layer of extraction unit, so as to obtain RGB feature images at different stages.
Further, the method for acquiring the Thermal feature map F n T includes: and for a group of training images in the training set, sending the Thermal images to a feature extraction unit of a Thermal encoder for feature extraction to obtain feature images extracted by each layer of extraction unit, so as to obtain Thermal feature images at different stages.
In this embodiment, the method for performing multi-modal feature fusion by using the RGBT fusion module includes:
Channel splicing is carried out on the input features F n RGB and F n T, the F n RGB and the F n T after channel splicing are sliced through a channel attention and a Sigmoid activation function, and attention force diagrams of the F n RGB and the F n T on a channel domain are respectively obtained; multiplying F n RGB and F n T with their corresponding attention patterns to obtain And/>To accomplish feature enhancement over the channel domain; will/>And/>Attention enhancement and channel splicing are respectively carried out on the spatial domain, then a Softmax activation function is used for activation, and slicing is carried out, so that/>, respectively, are obtainedAnd/>Attention over the spatial domain strives for, realet/>And/>Multiplying the attention attempts corresponding to the two to obtain/>, respectivelyAnd/>To accomplish the attention enhancement on the spatial domain, will eventually/>And/>And adding corresponding elements to finally obtain the fused characteristic F n RGBT.
In this embodiment, the feature enhancement on the channel domain, the whole process can be expressed as:
Wherein Concat () represents a channel splicing operation, CA () represents a channel attention operation, sigmoid () represents a Sigmoid activation function, slice () represents a slicing operation, and F n CA_RGB and F n CA_T represent attention attempts at F n RGB、Fn T, respectively.
Further, the fused feature F n RGBT is obtained, and the whole process can be expressed as follows:
Where SA () represents the spatial attention mechanism, softmax () represents the softmax activation function, and F n SA_RGB and F n SA_T represent respectively And/>Is stricken in the spatial domain.
The resolution sizes of F 1 RGBT、F2 RGBT、F3 RGBT and F 4 RGBT output by the four stages of the RGBT fusion module are 1/4,1/8,1/16 and 1/32 respectively;
The fusion features F 2 RGBT、F3 RGBT and F 4 RGBT with the resolution of 1/8,1/16 and 1/32 are input into a self-distillation learning module to perform self-distillation learning. The fusion characteristic F 4 RGBT of the network deep layer contains more sufficient global context information, so that the counting precision is higher. Therefore, the fusion feature F 4 RGBT of the deep network is used as a soft label to guide the feature F 2 RGBT、F3 RGBT of the shallow network to better perform multi-mode feature fusion, specifically, the prediction result of F 4 RGBT is used as a soft label, and the loss L 1 a and L 1 b between F 2 RGBT and F 3 RGBT and the soft label are calculated by using an L 1 loss function (i.e. MAE average absolute error), and the whole process can be expressed by a formula:
S103, feeding F 2 RGBT、F3 RGBT and F 4 RGBT into a self-distillation learning module to perform self-distillation learning, and calculating self-distillation losses L 1 a and L 1 b; and F 2 RGBT、F3 RGBT and F 4 RGBT are sent to a multi-scale regression module to perform feature aggregation, multi-scale sensing of a target area and people number prediction, and finally a predicted density map is obtained.
Specifically, F 3 RGBT and F 4 RGBT use a 2-fold upsampling and a 4-fold upsampling respectively, so that the resolution size matches with F 2 RGBT, and then the number of characteristic channels is adjusted by using a 3×3 convolution for all three of F 2 RGBT、F3 RGBT and F 4 RGBT, so that the resolution of the three and the number of channels are kept consistent, and characteristic addition is performed; sending the added result into a four-branch structure, wherein the four branch structures in the four-branch structure are composed of standard convolution and cavity convolution with different sizes, the convolution kernel sizes are respectively 1 multiplied by 1,3 multiplied by 3,5 multiplied by 5 and 7 multiplied by 7, and the cavity rates of the cavity convolution are respectively 1,3, 5 and 7; and (3) channel splicing is carried out on the results of the four branches, and then the regression of the reduction and density map is carried out through a 3X 3 convolution and a 1X 1 convolution, so that the prediction result of the number of people in the image is obtained.
S104, calculating Bayesian loss L bayes according to the predicted density diagram and the label density diagram of the corresponding training image, calculating a network overall loss function L density according to the Bayesian loss L bayes to obtain loss according to the network overall loss function L density, wherein the network overall loss function L density is a weighted combination of Bayesian loss and self-distillation loss, carrying out back propagation on the calculated loss to update model parameters, and training according to the updated model parameters to obtain the multi-model crowd counting model.
Specifically, the calculation formula of the bayesian loss L bayes is: ; where N is the number of annotation points, c n is the actual number of people per annotation point, E [ c n ] is the expected number of people per annotation point, and L 1 represents the L 1 distance function. Further, the calculation formula of the overall network loss function L density is: l density=Lbayes1﹒L1 a2﹒L1 b, wherein λ 1 and λ 2 are adjustable super parameters, and L 1 a is the self-distillation loss between F 2 RGBT and F 4 RGBT; l 1 b is a self-distillation loss between F 3 RGBT and F 4 RGBT.
In summary, in the training method of the multi-modal crowd counting model in the above embodiment of the present invention, an improved Swin transform network is constructed by adding an RGBT fusion module, a self-distillation learning module and a multi-scale regression module to the original Swin transform network, and the multi-modal crowd counting model is trained by the improved Swin transform network to solve the technical problem of the multi-modal crowd counting method in the prior art; specifically, the RGB encoder and the Thermal encoder adopt a Swin transform network as backbone networks, the RGB encoder and the Thermal encoder have unique sliding window self-attentiveness and global receptive fields, global context information can be efficiently modeled, meanwhile, the design of the sliding window can enhance information interaction among cross windows and reduce calculation cost, and the defect based on CNNs method is effectively overcome. The RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information, and distributes different attention weights for different modes through the attention mechanism, so that the difference of the different modes is better focused, and the multi-mode information fusion is promoted. The self-distillation learning module takes the fusion characteristics of the deep layers of the network as soft labels to guide the multi-modal fusion of the shallow layers of the network, so that the characteristic interaction between the deep layers and the shallow layers of the network is effectively promoted, and the multi-modal characteristic fusion is further promoted. The multi-scale regression module adopts standard convolution and cavity convolution with different sizes, so that the multi-scale information extraction capability of the network is further enhanced, and the problem of target scale change is more effectively solved.
Example two
A second embodiment of the present invention provides a multi-modal crowd counting model training system, the system comprising:
The system comprises an acquisition module, a judgment module and a display module, wherein the acquisition module is used for acquiring crowd scene images, the crowd scene images comprise RGB images and Thermal images, the crowd positions of the crowd scene images are marked to obtain a plurality of groups of training images, a training set for training a crowd counting model is constructed according to the plurality of groups of training images, and a label density map corresponding to each group of training images in the training set is generated;
The feature fusion module is used for respectively sending the RGB image and the Thermal image into an RGB encoder and a Thermal encoder to perform feature extraction to respectively obtain an RGB feature map F n RGB and a Thermal feature map F n T in different stages, wherein n is {1,2,3,4}, respectively sending the RGB feature map F n RGB and the Thermal feature map F n T obtained in each stage into an RGBT fusion module in a corresponding stage to perform multi-mode feature fusion to obtain a fused feature F n RGBT, and the fused feature F n RGBT comprises F 1 RGBT、F2 RGBT、F3 RGBT and F 4 RGBT;
The prediction module is used for sending F 2 RGBT、F3 RGBT and F 4 RGBT into the self-distillation learning module to perform self-distillation learning, and calculating self-distillation losses L 1 a and L 1 b; f 2 RGBT、F3 RGBT and F 4 RGBT are sent to a multi-scale regression module to perform feature aggregation, multi-scale sensing of a target area and people number prediction, and finally a predicted density map is obtained;
The updating module is used for calculating a Bayesian loss L bayes according to the predicted density map and the label density map of the corresponding training image, calculating a network overall loss function L density according to the Bayesian loss L bayes to calculate a loss according to the network overall loss function L density, wherein the network overall loss function L density is a weighted combination of Bayesian loss and self-distillation loss, carrying out back propagation on the calculated loss to update model parameters, and training according to the updated model parameters to obtain a multi-mode crowd counting model.
In summary, in the training system for the multi-modal crowd counting model in the above embodiment of the present invention, an improved Swin transform network is constructed by adding an RGBT fusion module, a self-distillation learning module and a multi-scale regression module to the original Swin transform network, and the improved Swin transform network is used to train the multi-modal crowd counting model to solve the technical problem of the multi-modal crowd counting method in the prior art; specifically, the RGB encoder and the Thermal encoder adopt a Swin transform network as backbone networks, the RGB encoder and the Thermal encoder have unique sliding window self-attentiveness and global receptive fields, global context information can be efficiently modeled, meanwhile, the design of the sliding window can enhance information interaction among cross windows and reduce calculation cost, and the defect based on CNNs method is effectively overcome. The RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information, and distributes different attention weights for different modes through the attention mechanism, so that the difference of the different modes is better focused, and the multi-mode information fusion is promoted. The self-distillation learning module takes the fusion characteristics of the deep layers of the network as soft labels to guide the multi-modal fusion of the shallow layers of the network, so that the characteristic interaction between the deep layers and the shallow layers of the network is effectively promoted, and the multi-modal characteristic fusion is further promoted. The multi-scale regression module adopts standard convolution and cavity convolution with different sizes, so that the multi-scale information extraction capability of the network is further enhanced, and the problem of target scale change is more effectively solved.
Furthermore, an embodiment of the present invention proposes a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the method in the above-mentioned embodiment.
Furthermore, an embodiment of the present invention also proposes a data processing apparatus including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method in the above embodiment when executing the program.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims (8)

1. The multi-mode crowd counting model training method is characterized in that the method is applied to an improved Swin Transfomer network, the improved Swin transform network comprises an RGB encoder, a Thermal encoder, an RGBT fusion module, a self-distillation learning module and a multi-scale regression module, the RGB encoder and the Thermal encoder comprise a plurality of layers of feature extraction units, and resolution ratios of different sizes are arranged among the feature extraction units of each layer; the RGBT fusion module adopts a channel attention mechanism and a space attention mechanism to fuse RGB information and Thermal information; the self-distillation learning module is used for promoting feature interaction between deep and shallow layers of the network so as to promote multi-mode feature fusion; the multi-scale regression module adopts standard convolution and cavity convolution with different sizes to enhance the multi-scale information extraction capability of the network;
The method comprises the following steps:
The crowd scene image is obtained, the crowd scene image comprises an RGB image and a Thermal image, the crowd positions of the crowd scene image are marked to obtain a plurality of groups of training images, a training set for training a crowd counting model is built according to the plurality of groups of training images, and a label density map corresponding to each group of training images in the training set is generated;
The RGB image and the Thermal image are respectively sent to an RGB encoder and a Thermal encoder to perform feature extraction, an RGB feature map F n RGB and a Thermal feature map F n T of different stages are respectively obtained, wherein n is E {1,2,3,4}, the RGB feature map F n RGB and the Thermal feature map F n T obtained in each stage are respectively sent to an RGBT fusion module of a corresponding stage to perform multi-mode feature fusion to obtain a fused feature F n RGBT, and the fused feature F n RGBT comprises F 1 RGBT、F2 RGBT、F3 RGBT and F 4 RGBT;
Feeding F 2 RGBT、F3 RGBT and F 4 RGBT into a self-distillation learning module for self-distillation learning, and calculating self-distillation losses L 1 a and L 1 b; f 2 RGBT、F3 RGBT and F 4 RGBT are sent to a multi-scale regression module to perform feature aggregation, multi-scale sensing of a target area and people number prediction, and finally a predicted density map is obtained;
Calculating a Bayesian loss L bayes according to the predicted density map and the label density map of the corresponding training image, calculating a network overall loss function L density according to the Bayesian loss L bayes to calculate a loss according to the network overall loss function L density, wherein the network overall loss function L density is a weighted combination of Bayesian loss and self-distillation loss, carrying out back propagation on the calculated loss to update model parameters, and training according to the updated model parameters to obtain a multi-mode crowd counting model;
In the step of respectively sending the RGB feature map F n RGB and the Thermal feature map F n T obtained in each stage to the RGBT fusion module in the corresponding stage to perform multi-modal feature fusion to obtain the fused feature F n RGBT, the method for performing multi-modal feature fusion by the RGBT fusion module includes:
Channel splicing is carried out on the input features F n RGB and F n T, the F n RGB and the F n T after channel splicing are sliced through a channel attention and a Sigmoid activation function, and attention force diagrams of the F n RGB and the F n T on a channel domain are respectively obtained;
Multiplying F n RGB and F n T with their corresponding attention patterns to obtain And/>To accomplish feature enhancement over the channel domain;
Will be And/>Attention enhancement and channel splicing are respectively carried out on the spatial domain, then a Softmax activation function is used for activation, and slicing is carried out, so that/>, respectively, are obtainedAnd/>Attention over the spatial domain strives for, realet/>And/>Multiplying the attention attempts corresponding to the two to obtain/>, respectivelyAnd/>To accomplish the attention enhancement on the spatial domain, will eventually/>And/>And adding corresponding elements to finally obtain the fused characteristic F n RGBT.
2. The method of claim 1, wherein the step of sending the RGB image and the Thermal image to an RGB encoder and a Thermal encoder for feature extraction to obtain an RGB feature map F n RGB and a Thermal feature map F n T in different stages, respectively, includes:
for a group of training images in the training set, sending the RGB images in the training set to a feature extraction unit of an RGB encoder for feature extraction to obtain feature images extracted by each layer of extraction unit, so as to obtain RGB feature images at different stages;
And for a group of training images in the training set, sending the Thermal images to a feature extraction unit of a Thermal encoder for feature extraction to obtain feature images extracted by each layer of extraction unit, so as to obtain Thermal feature images at different stages.
3. The method of claim 1, wherein the step of sending F 2 RGBT、F3 RGBT and F 4 RGBT to a multi-scale regression module for feature aggregation and multi-scale perception of target areas and for people prediction comprises:
F 3 RGBT and F 4 RGBT respectively use a 2-time up-sampling and a 4-time up-sampling to enable the resolution to be matched with F 2 RGBT, and then use a 3X 3 convolution to adjust the number of characteristic channels for the three of F 2 RGBT、F3 RGBT and F 4 RGBT to enable the resolution and the number of channels to be consistent, and perform characteristic addition;
Sending the added result into a four-branch structure, wherein the four branch structures in the four-branch structure are composed of standard convolution and cavity convolution with different sizes, the convolution kernel sizes are respectively 1 multiplied by 1,3 multiplied by 3,5 multiplied by 5 and 7 multiplied by 7, and the cavity rates of the cavity convolution are respectively 1,3, 5 and 7;
And (3) channel splicing is carried out on the results of the four branches, and then the regression of the reduction and density map is carried out through a 3X 3 convolution and a 1X 1 convolution, so that the prediction result of the number of people in the image is obtained.
4. The multi-modal population count model training method of claim 1 wherein the bayesian penalty L bayes is calculated as:
Where N is the number of annotation points, c n is the actual number of people per annotation point, E [ c n ] is the expected number of people per annotation point, and L 1 represents the L 1 distance function.
5. The multi-modal population count model training method of claim 1 wherein the overall network loss function L density is calculated as:
L density=Lbayes1﹒L1 a2﹒L1 b, wherein λ 1 and λ 2 are adjustable hyper-parameters;
Wherein L 1 a is the self-distillation loss between F 2 RGBT and F 4 RGBT; l 1 b is a self-distillation loss between F 3 RGBT and F 4 RGBT.
6. A multimodal population count model training system, wherein the system performs the multimodal population count model training method of any of the preceding claims 1-5, the system comprising:
The system comprises an acquisition module, a judgment module and a display module, wherein the acquisition module is used for acquiring crowd scene images, the crowd scene images comprise RGB images and Thermal images, the crowd positions of the crowd scene images are marked to obtain a plurality of groups of training images, a training set for training a crowd counting model is constructed according to the plurality of groups of training images, and a label density map corresponding to each group of training images in the training set is generated;
The feature fusion module is used for respectively sending the RGB image and the Thermal image into an RGB encoder and a Thermal encoder to perform feature extraction to respectively obtain an RGB feature map F n RGB and a Thermal feature map F n T in different stages, wherein n is {1,2,3,4}, respectively sending the RGB feature map F n RGB and the Thermal feature map F n T obtained in each stage into an RGBT fusion module in a corresponding stage to perform multi-mode feature fusion to obtain a fused feature F n RGBT, and the fused feature F n RGBT comprises F 1 RGBT、F2 RGBT、F3 RGBT and F 4 RGBT;
The prediction module is used for sending F 2 RGBT、F3 RGBT and F 4 RGBT into the self-distillation learning module to perform self-distillation learning, and calculating self-distillation losses L 1 a and L 1 b; f 2 RGBT、F3 RGBT and F 4 RGBT are sent to a multi-scale regression module to perform feature aggregation, multi-scale sensing of a target area and people number prediction, and finally a predicted density map is obtained;
The updating module is used for calculating a Bayesian loss L bayes according to the predicted density map and the label density map of the corresponding training image, calculating a network overall loss function L density according to the Bayesian loss L bayes to calculate a loss according to the network overall loss function L density, wherein the network overall loss function L density is a weighted combination of Bayesian loss and self-distillation loss, carrying out back propagation on the calculated loss to update model parameters, and training according to the updated model parameters to obtain a multi-mode crowd counting model.
7. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the multi-modal population count model training method of any of claims 1-5.
8. A data processing apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multimodal population count model training method of any of claims 1-5 when the program is executed by the processor.
CN202410270297.7A 2024-03-11 2024-03-11 Multi-modal crowd counting model training method, system, storage medium and equipment Active CN117876824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410270297.7A CN117876824B (en) 2024-03-11 2024-03-11 Multi-modal crowd counting model training method, system, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410270297.7A CN117876824B (en) 2024-03-11 2024-03-11 Multi-modal crowd counting model training method, system, storage medium and equipment

Publications (2)

Publication Number Publication Date
CN117876824A CN117876824A (en) 2024-04-12
CN117876824B true CN117876824B (en) 2024-05-10

Family

ID=90594987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410270297.7A Active CN117876824B (en) 2024-03-11 2024-03-11 Multi-modal crowd counting model training method, system, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN117876824B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155210A (en) * 2021-11-17 2022-03-08 大连民族大学 Crowd counting method based on attention mechanism and standardized dense void space multi-scale fusion network
CN115423847A (en) * 2022-11-04 2022-12-02 华东交通大学 Twin multi-modal target tracking method based on Transformer
CN115731280A (en) * 2022-11-22 2023-03-03 哈尔滨工程大学 Self-supervision monocular depth estimation method based on Swin-Transformer and CNN parallel network
CN116311083A (en) * 2023-05-19 2023-06-23 华东交通大学 Crowd counting model training method and system
WO2024021394A1 (en) * 2022-07-29 2024-02-01 南京邮电大学 Person re-identification method and apparatus for fusing global features with ladder-shaped local features

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155210A (en) * 2021-11-17 2022-03-08 大连民族大学 Crowd counting method based on attention mechanism and standardized dense void space multi-scale fusion network
WO2024021394A1 (en) * 2022-07-29 2024-02-01 南京邮电大学 Person re-identification method and apparatus for fusing global features with ladder-shaped local features
CN115423847A (en) * 2022-11-04 2022-12-02 华东交通大学 Twin multi-modal target tracking method based on Transformer
CN115731280A (en) * 2022-11-22 2023-03-03 哈尔滨工程大学 Self-supervision monocular depth estimation method based on Swin-Transformer and CNN parallel network
CN116311083A (en) * 2023-05-19 2023-06-23 华东交通大学 Crowd counting model training method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
An interactive network based on transformer for multimodal crowd counting;yu ying等;springer;20230630;全文 *
Graph Enhancement and Transformer Aggregation Network for RGB-Thermal Crowd Counting;Yi Pan等;IEEE;20240206;全文 *
UTLNet:Uncertainty-Aware Transformer Localization Network for RGB-Depth Mirror Segmentation;Wujie Zhou;IEEE;20231011;全文 *
基于通道域注意力机制的人群密度估计算法研究;马骞;;电子设计工程;20200803(第15期);全文 *

Also Published As

Publication number Publication date
CN117876824A (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN111627019A (en) Liver tumor segmentation method and system based on convolutional neural network
CN113012172A (en) AS-UNet-based medical image segmentation method and system
CN111932529B (en) Image classification and segmentation method, device and system
CN109543632A (en) A kind of deep layer network pedestrian detection method based on the guidance of shallow-layer Fusion Features
CN110648331B (en) Detection method for medical image segmentation, medical image segmentation method and device
CN116681679A (en) Medical image small target segmentation method based on double-branch feature fusion attention
CN116309648A (en) Medical image segmentation model construction method based on multi-attention fusion
CN113326735B (en) YOLOv 5-based multi-mode small target detection method
Cheng et al. DDU-Net: A dual dense U-structure network for medical image segmentation
CN115311194A (en) Automatic CT liver image segmentation method based on transformer and SE block
CN117972138B (en) Training method and device for pre-training model and computer equipment
Yang et al. Lane detection with versatile atrousformer and local semantic guidance
CN108009512A (en) A kind of recognition methods again of the personage based on convolutional neural networks feature learning
Zhang et al. Vestibule segmentation from CT images with integration of multiple deep feature fusion strategies
CN117576149A (en) Single-target tracking method based on attention mechanism
CN117876824B (en) Multi-modal crowd counting model training method, system, storage medium and equipment
CN116485791A (en) Automatic detection method and system for double-view breast tumor lesion area based on absorbance
CN115331171A (en) Crowd counting method and system based on depth information and significance information
Zhao et al. Research on human behavior recognition in video based on 3DCCA
CN110276391B (en) Multi-person head orientation estimation method based on deep space-time conditional random field
CN116071825B (en) Action behavior recognition method, system, electronic equipment and storage medium
Lin et al. A meta-fusion RCNN network for endoscopic visual bladder lesions intelligent detection
Sun et al. A Metaverse text recognition model based on character-level contrastive learning
CN117994509B (en) Intelligent fundus image perfusion-free region identification method based on interaction
CN116486203B (en) Single-target tracking method based on twin network and online template updating

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240712

Address after: 2254, 2nd Floor, No. 266 Bowe Road, Traditional Chinese Medicine Science and Technology Innovation City, Ganjiang New District, Nanchang City, Jiangxi Province, 330000

Patentee after: Tongyou (Jiangxi) Network Technology Co.,Ltd.

Country or region after: China

Address before: No. 808, Shuanggang East Road, Nanchang Economic and Technological Development Zone, Jiangxi Province, 330000

Patentee before: East China Jiaotong University

Country or region before: China