CN113095246A - Cross-domain self-adaptive people counting method based on transfer learning and scene perception - Google Patents

Cross-domain self-adaptive people counting method based on transfer learning and scene perception Download PDF

Info

Publication number
CN113095246A
CN113095246A CN202110418583.XA CN202110418583A CN113095246A CN 113095246 A CN113095246 A CN 113095246A CN 202110418583 A CN202110418583 A CN 202110418583A CN 113095246 A CN113095246 A CN 113095246A
Authority
CN
China
Prior art keywords
image
domain
scene
density
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110418583.XA
Other languages
Chinese (zh)
Inventor
姜那
温兴森
许�鹏
施智平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Capital Normal University
Original Assignee
Capital Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Capital Normal University filed Critical Capital Normal University
Priority to CN202110418583.XA priority Critical patent/CN113095246A/en
Publication of CN113095246A publication Critical patent/CN113095246A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a cross-domain self-adaptive people counting method and a system based on transfer learning and scene perception, wherein the method comprises the following steps: s1: training a sample pair consisting of a source domain image and a target domain image to a data enhancement network based on style migration learning, and performing four-term constraint measurement on an output image to obtain a source domain false image and a target domain false image; s2: inputting a real image of a source domain and a false image of a target domain into a scene perception classifier and a multi-branch density estimator to predict crowd density; s3: and according to the conversion rate index, carrying out self-adaptive adjustment on the data enhancement network and the multi-branch density estimator. According to the method, data association of a lower layer is built through data enhancement, and a high-level knowledge bridge is built through joint training; the probability that the scene perception prediction image belongs to different scenes is utilized to provide perception weight, high-adaptability crowd density estimation is achieved in a weighting fusion mode, and the generalization and adaptability of the model are guaranteed.

Description

Cross-domain self-adaptive people counting method based on transfer learning and scene perception
Technical Field
The invention relates to the field of computer vision, in particular to a cross-domain self-adaptive people counting method and system based on transfer learning and scene perception.
Background
People counting is a basic research problem in the field of computer vision, is also a key task in intelligent analysis of monitoring videos, and is especially important for application of traffic control, abnormal early warning, flow analysis and the like. It mainly focuses on automatically estimating the crowd density in real surveillance videos or images using intelligent algorithms. As a matter of research value, has attracted much attention from the industrial and academic circles, and many innovative methods have been proposed one after another, greatly promoting its rapid development. For example, Wang et al first introduced an end-to-end demographics framework proposed by convolution structure, a multi-column convolution structure designed by Zhang et al, and so on. In these methods, deep learning is well-certified for the effectiveness of demographics. On the basis of the method, Liu et al proposes DecideNet containing detection and regression branches, Li et al proposes CSRNet focusing on high-density scenes, Sindagi et al proposes a network fusing two characteristics from top to bottom and from bottom to top, and the people counting precision on a closed finite set is further promoted to a certain extent through multi-task, multi-scale and other designs.
However, due to the reasons of limited training set range, wide practical data, difficulty in labeling and the like, the technologies or platforms related to people counting currently face the limitation of insufficient generalization capability when applied to real open monitoring scenes. This means that in an open real environment, the algorithmic model needs to be continuously improved to adapt to cross-domain data and multi-type scenes.
Indeed, domain adaptation problems have been explored in many other visual tasks, such as segmentation, detection, object re-recognition, etc. For example, Sankaranarayana et al have employed Generative Adaptive Networks (GANs) to mitigate data distribution differences between source and target domains. Inoue et al propose a weak supervised training method to achieve domain adaptation. Deng et al and Zhong et al introduced migratory learning. The principle of all these methods lies in the creation of a bridge of knowledge transfer between the source domain and the target domain by means of migration learning, thereby essentially improving the generalization ability of the model in different tasks. These innovations and contributions are not open to the rapid development of GANs. In 2019, Wang and Gao et al successively put forward SeCycleGan and DACC processing domain adaptive people counting, and through the combined training of synthetic sample enhanced data and original data, density estimation knowledge obtained in an active labeling domain is migrated, and remarkable generalization capability improvement is obtained. However, a key problem is ignored, that is, the composite image and the real monitoring image have many differences in illumination, scene and the like, which causes the people counting task which is sensitive to scene change to be easily trapped in local optimization. Worse, the difference between the synthesized data and the real data is sometimes larger than the difference between the real data of different cameras, and the storage idleness and generation consumption of the synthesized data are extremely high.
Therefore, how to set the real closed data as the source domain and practically and effectively establish the data association between the source domain and the actual open target domain so as to realize the knowledge migration beneficial to people counting becomes a problem to be solved urgently.
Disclosure of Invention
In order to solve the technical problems, the invention provides a cross-domain adaptive people counting method and system based on transfer learning and scene perception.
The technical solution of the invention is as follows: a cross-domain adaptive people counting method based on transfer learning and scene perception comprises the following steps:
step S1: using a sample pair composed of the source domain image and the target domain image to train a data enhancement network based on style migration learning, and performing style similarity loss L on the output imageSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCObtaining false images of a source domain and a target domain by four constraint measurements;
step S2: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, fusing the obtained density characteristic graph with the scene perception weight through a multi-branch density estimator to predict crowd density;
step S3: and according to the conversion rate index, carrying out self-adaptive adjustment on the data enhancement network based on the style transfer learning and the multi-branch density estimator.
Compared with the prior art, the invention has the following advantages:
1. the method provided by the invention introduces style migration learning into the data enhancement technology and designs style similarity LSSCContent similarity LCTCConsistency L of cyclic reconstructionCYCAnd statistical similarity LCSCThe method has the advantages that four constraints are used, data association between the source domain and the target domain is enhanced, the method is different from the conventional method for generating the image density map by dynamic estimation, and the density map of the source domain image is used, so that a medium is provided for knowledge transfer, and meanwhile, errors accumulated by the density map of the sample generated by dynamic estimation can be effectively reduced.
2. The method provided by the invention provides a scene perception estimation concept applied to crowd density estimation, obtains perception weight by classifying sample scenes, avoids negative influence on parameter learning caused by alternate change of different scene samples during training, and fully exerts the function of each sample. Meanwhile, the multi-branch design adopts different structures, after front background information is distinguished through shared convolution, attention modes for people counting in different scenes are strengthened through expanding a receptive field and a contact context, performance complementation is realized through weighting fusion, and the method is very effective particularly for images with obvious change of longitudinal crowd density caused by an obvious perspective relation.
3. Crowd density estimation is carried out in a real scene, and the current network needs to be updated all the time so as to keep the optimal adaptability and generalization capability of the model. The conversion rate in the method dynamically evaluates whether the data enhancement and scene perception of the current model are saturated or not, and if the current model has an ascending space, the model is subjected to self-adaptive fine adjustment, so that an intelligent people counting model which is most suitable for the current monitoring scene is obtained, the number of people in the monitoring video/image is rapidly predicted and analyzed, necessary early warning is provided for workers, and potential safety hazards such as trampling, illegal gathering, road congestion and the like are effectively avoided.
Drawings
FIG. 1 is a flowchart of a cross-domain adaptive people counting method based on transfer learning and scene perception according to an embodiment of the present invention;
fig. 2 is a block diagram of a step S1 in a cross-domain adaptive people counting method based on transfer learning and scene perception in an embodiment of the present invention: using a sample pair composed of the source domain image and the target domain image to train a data enhancement network based on style migration learning, and performing style similarity loss L on the output imageSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCObtaining a flow chart of the false images of the source domain and the target domain by the four constraint measurements;
fig. 3 is a block diagram of a step S12 in a cross-domain adaptive people counting method based on transfer learning and scene perception in an embodiment of the present invention: for training sample pairs<si,tj>Training the data enhancement network based on style migration learning, and outputting a generated image gSi-TjAnd gTj-SiAnd subjecting it to loss of style similarity LSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCFour constraint measures are carried out to obtain a flow chart of the trained data enhancement network based on the style migration learning;
FIG. 4 is a schematic structural diagram of a data enhancement network based on style migration learning according to an embodiment of the present invention;
fig. 5 is a block diagram of a step S2 in the cross-domain adaptive people counting method based on transfer learning and scene perception in the embodiment of the present invention: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, the obtained density characteristic graph is fused with scene perception weight through a multi-branch density estimator, and a flow chart for predicting crowd density is obtained;
FIG. 6 is a schematic structural diagram of a training scene classifier and a multi-branch density estimator according to an embodiment of the present invention;
fig. 7 is a block diagram illustrating a structure of a cross-domain adaptive people counting system based on transfer learning and scene awareness in an embodiment of the present invention.
Detailed Description
The invention provides a cross-domain self-adaptive people counting method based on transfer learning and scene perception, which effectively establishes data association between the cross-domain self-adaptive people counting method and an actual open target domain, provides perception weight by utilizing the probability that a scene perception predicted image belongs to different scenes, and realizes high-adaptability crowd density estimation in a weighting fusion mode.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings.
Example one
As shown in fig. 1, an embodiment of the present invention provides a cross-domain adaptive people counting method based on transfer learning and scene perception, including the following steps:
step S1: using a sample pair composed of the source domain image and the target domain image to train a data enhancement network based on style migration learning, and performing style similarity loss L on the output imageSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCObtaining false images of a source domain and a target domain by four constraint measurements;
in the step, a false image is generated by learning the source domain content and the target domain style, and in the process, the style similarity, the content similarity, the cyclic reconstruction consistency and the statistical similarity of people number are utilized to restrict and reduce a possible mapping space and keep an effective knowledge migration bridge.
Step S2: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, the obtained density characteristic graph is fused with scene perception weight through a multi-branch density estimator to predict crowd density;
the scene perceptron classifier in the step predicts the probability that the input sample belongs to different scene classifications and takes the probability as a perception weight, and then the perception weight and the density characteristic diagram extracted by the corresponding branch are fused through the multi-branch density estimator to predict the crowd density, thereby realizing the complementation of the people counting results in different observation modes and ensuring that the method provided by the invention can adapt to the real monitoring environment with rich scene types.
Step S3: and according to the conversion rate index, carrying out self-adaptive adjustment on the data enhancement network and the multi-branch density estimator based on the style migration learning.
As shown in fig. 2, in one embodiment, the step S1: using a sample pair composed of the source domain image and the target domain image to train a data enhancement network based on style migration learning, and performing style similarity loss L on the output imageSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCThe four constraint metrics to obtain the source domain and target domain false images specifically include:
step S11: using source domain image S ═ { S } with labeled density graph1,s2,…,snAnd the unlabeled target domain image T ═ T1,t2,…,tnForm training sample pairs<si,tj>(ii) a Wherein s isiFor the ith source domain image, tjIs the jth target domain image;
in this step, the source domain data with density label is used as the original image S ═ S1,s2,...,snAnd randomly selecting 5 images T ═ T in the target domain in the same number as the source data1,t2,...,tnAs a target image. Considering the operation performance and the work efficiency of the video card, the embodiment of the invention adjusts all data to 512 × 512 size and forms the input pair from the source domain to the target domain<si,tj>And (4) listing.
Step S12: for training sample pairs<si,tj>Training the data enhancement network based on style migration learning, and outputting a generated image gSi-TjAnd gTj-SiAnd subjecting it to style similarity lossLSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCAnd obtaining a trained data enhancement network based on the style migration learning by four constraint metrics.
The embodiment of the invention is improved on the basis of the GANs structure, and the data enhancement network based on the style migration learning is constructed. The network consists of two generators and two discriminators, wherein the generators have the same structure but do not share parameters, and the generators consist of two convolutions with step size of 2, ten residual modules, and two deconvolutions with gradient of 1/2. In order to include more population and background details in the generated false image, a self-attention model is also introduced in the residual module. And the two classification loss functions adopted by the discriminator are used for judging whether the image is true or false and updating the parameters of the discriminator. The discriminator introduces spectral normalization for extracting high semantic features, and is very effective for dense scenes or scenes with obvious perspective relation. Meanwhile, the invention also designs style similarity loss LSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCStatistical similarity loss LCSCAnd four constraints for constraining the generated image.
As shown in fig. 3, in one embodiment, the step S12: for training sample pairs<si,tj>Training the data enhancement network based on style migration learning, and outputting a generated image gSi-TjAnd gTj-SiAnd subjecting it to loss of style similarity LSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCThe four constraint metrics are used for obtaining a trained data enhancement network based on style migration learning, and the method specifically comprises the following steps:
step S121: will train the sample pair<si,tj>Inputting a data enhancement network based on style migration learning, and outputting a generated image gSi-TjAnd gTj-SiThe style similarity loss L between the corresponding target domain image and the corresponding target domain image is calculated in two ways through the following formulas (1) to (3)SSCBy usingTo constrain the image gSi-TjWith target field image tjThe visual style of (1); at the same time, image g is constrainedTj-SiWith source-domain image siThe visual style of (1);
Figure BDA0003026915190000051
Figure BDA0003026915190000052
Figure BDA0003026915190000053
wherein Gram (·) represents a Gram matrix and is used for extracting image scene style information; w and h represent the width and height of the image, respectively;
Figure BDA0003026915190000054
indicating the loss from the source domain to the target domain,
Figure BDA0003026915190000055
the loss from the target domain to the source domain is represented, the two calculation modes are the same, the calculation directions are reciprocal, and the two-way loss L is formed by the twoSSC
In this step, style similarity loss L is used in the reciprocal training process from source domain to target domain and vice versa by calculating style differences between the target image expressed by the gram matrix and the generated imageSSCAnd performing supervision so as to achieve the aim of transferring the target style to the generated image.
Step S122: content similarity loss L between generated image and source domain image by bi-directional computationCTCConstraining the generated image gSi-TjWith source-domain image siThe visual content of (a); at the same time, the image g is constrained to be generatedTj-SiWith a target image tjThe visual content of (a);
in this step, to make generationThe image has the image content contained in the source domain image, so that the subsequent multi-branch density estimator can be supervised by using the same labeled image, and the content similarity loss L between the image and the source domain image is generated by bidirectional calculationCTCTo save the population distribution content in the source domain image and constrain the generated image gSi-TjHaving an image siIs the visual content of (a) and simultaneously the image gTj-SiHaving an image tjThe visual content of (2). Calculating content similarity loss LCTCThe formula of (a) is consistent with the MSE loss function.
Step S123: the cycle reconstruction consistency loss L is calculated using the following calculation equation (4)CYCBeam source field image siGenerated image gSi-TjCan be circularly changed back to the source domain image si(ii) a At the same time, the target image t is constrainedjGenerated image gTj-SiCyclically changing back to the target image tj
Figure BDA0003026915190000061
Wherein G isSAnd GTGenerators respectively representing a cycle consistency migration network; l isCYCBy reciprocal
Figure BDA0003026915190000062
And
Figure BDA0003026915190000063
the two parts are formed into a whole body,
Figure BDA0003026915190000064
the consistency of the cyclic reconstruction of the beam domain image,
Figure BDA0003026915190000065
and (5) restraining the consistency of the circular reconstruction of the target domain image.
For the learning based on the idea of generating fighting game, the learning only occurs in the high-dimensional feature space, which results in more possibility of mutual conversion. To reduce this possible spatial redundancy, theThe conversion tends to favor the people counting task, and the consistency loss L needs to be rebuilt by using circulationCYCSupervision and restriction, calculating the bidirectional L by equation (4) above1Distance is achieved. After the source domain image is generated, the source domain image is reconstructed back to the source domain image through different parameter networks with the same structure, and similarly, after the target domain image is generated, the target domain image is reconstructed back to the target domain image through different parameter networks with the same structure, so that the possible knowledge migration space of the high-dimensional feature space is reduced, and the direction which is most beneficial to people counting approaches is reached.
Step S124: the statistical similarity loss L is calculated using the following equations (5) to (7)CSCSo that the target domain generation image can use the annotation information of the same content in the source domain image;
Figure BDA0003026915190000066
Figure BDA0003026915190000068
Figure BDA0003026915190000067
where P () represents the density profile of a dense scene passing through the crowd density predictor.
The data enhancement network based on the style migration learning provided by the invention not only has style and content constraints, but also ensures that the data enhancement image can use the annotation information of the source domain and the content image. Therefore, on the basis of the three constraints, a multi-branch density estimator is introduced into the network, people number statistics evaluation is carried out on the generated images and the source domain images, and the measurement of three channels among the images is upgraded to a single-channel density map space, so that the generated images are guaranteed to have a target domain style and source domain content, and meanwhile, the annotation information of the source domain and content images can be used.
Step S125: utilizing the following formula (8) to balance and control the influence of the four constraints on style and content extraction training to obtain false images of a source domain and a target domain;
L*=α1LCYC2LCSC3LCTC4LSSC (8)
wherein, { α [ [ alpha ] ]1,α2,α3,α4And the symbol is super reference.
In order to fully exert the supervision effects of different loss functions, the embodiment of the invention adopts a joint training mode to simultaneously use the loss functions, designs the loss weight according to experience, and performs weighted calculation on the total loss. Loss of style similarity LSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCThe corresponding weights are 2, 0.5, 0.1 and 0.01, respectively, whose numerical magnitudes represent the effect of different losses on the generated image.
FIG. 4 is a schematic diagram of a data enhancement network based on style migration learning, which designs style similarity LSSCContent similarity LCTCConsistency L of cyclic reconstructionCYCAnd statistical similarity LCSCThe method has the advantages that four constraints are used, data association between the source domain and the target domain is enhanced, the method is different from the conventional method for generating the image density map by dynamic estimation, and the density map of the source domain image is used, so that media are provided for knowledge transfer, and errors accumulated by the generation of the sample density map by dynamic estimation can be effectively reduced.
As shown in fig. 5, in one embodiment, the step S2: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; and then fusing the obtained density characteristic diagram with scene perception weight through a multi-branch density estimator to predict crowd density, wherein the method specifically comprises the following steps:
step S21: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain a scene classification, and obtaining a corresponding scene perception weight P ═ { P ═ P }1,p2,p3}; wherein the scene classification includes: dense scenes, sparse scenes, and medium density scenes;
to train the scene classifier, 3,000 images were manually labeled, each type of data containing 1,000 images. The annotation is based on the number of people in the scene, the scene with more than 100 people is divided into dense scenes, the scenes with less than 30 people are summarized into sparse scenes, and the image between the dense scenes and the sparse scenes is the medium-density scene. The scene classifier can be trained by utilizing the labeled data, and is mainly used for learning the perception weight P ═ P1,p2,p3Each of the weights reflects the impact of a different sample on updating the network branches, respectively. The scene with the probability of less than 30 persons is regarded as a sparse scene, the scene with the probability of more than 100 persons is regarded as a dense scene, the scenes with the number of other persons are regarded as a medium-density scene, and the specific training processes of the three scenes need to be independently completed.
Step S22: inputting the classified source domain and target domain false images into a multi-branch density estimator, and selecting corresponding branches for estimation according to the classification of the source domain and target domain false images to obtain corresponding density characteristic graphs; wherein the multi-branch density estimator comprises:
the first branch is a density estimator of the dense scene to obtain a density characteristic diagram of the dense scene;
the second branch is a density estimator of the sparse scene to obtain a density characteristic diagram of the sparse scene;
and the third branch is a density estimator of the medium-density scene, and a density characteristic diagram of the medium-density scene is obtained.
And (4) enabling the classified source domain false images and the classified target domain false images to enter a multi-branch density estimator. Considering that the knowledge attention modes of the scene statistics people with different densities are different, the invention designs three branches with different structures.
The first branch is a density estimator of the dense scene to obtain a density characteristic diagram of the dense scene; the embodiment of the invention mainly adopts convolution with the void ratio of 2.
The second branch is a density estimator of the sparse scene to obtain a density characteristic diagram of the sparse scene; the embodiment of the invention adopts convolution with a void ratio of 4. Both branches use hole convolution to enlarge the receptive field.
The third branch is a density estimator of the medium-density scene to obtain a density characteristic diagram of the medium-density scene; the embodiment of the invention introduces a self-attention module to learn the influence of different distance contexts on the estimation of the crowd density.
The multi-branch density estimator of the embodiment of the invention is composed of a convolution block sharing parameters and three branches not sharing the parameters, and the network structure configuration of each branch is shown in table 1:
table 1: three-branch network structure configuration of multi-branch density estimator
Figure BDA0003026915190000081
In table 1, K represents the convolution kernel size, S represents the step size, C represents the number of channels, D is the void ratio, and SA represents the self-attention module. The shared convolution block in the branch is responsible for distinguishing foreground information and background information, the dense scene branch uses low-void-rate convolution, the sparse scene branch uses high-void-rate convolution, and the receptive field is effectively expanded; other scene branches introduce a self-attention module to learn the influence of different distance contexts on the estimation of the crowd density. The attention modes of the three branch statistics are different due to different structures, and for a monitoring scene with perspective change, the three attention modes can be complemented.
Step S23: fusing the density characteristic diagram with the corresponding scene perception weight, and realizing the prediction of the crowd density by using the following formula (9);
Figure BDA0003026915190000091
wherein, <' > indicates a bit-wise product, pcRepresenting the probability of a sample belonging to class c scenarios, EcCharacteristic diagram representing the c-th branch prediction, I calculated by the formulaFinalIs the final fused estimated population density map.
To achieve the above mentioned attention pattern complementation, the present invention sets the scene perception weight P ═ P1,p2,p3And (4) fusing the scene classification probability and the density feature map obtained in the step S22 through a formula (9). Therefore, each training sample can generate positive feedback on knowledge updating, and the influence of scene change on single branch training is fundamentally avoided.
As shown in the schematic structural diagram of the training scene classifier and the multi-branch density estimator shown in fig. 6, the method provided by the invention introduces the scene perception estimation concept into the crowd density estimation, obtains the perception weight by classifying the sample scene, avoids the negative influence on parameter learning caused by the alternate change of different scene samples during training, and fully exerts the function of each sample. Meanwhile, the multi-branch design adopts different structures, after front background information is distinguished through shared convolution, attention modes for people counting in different scenes are strengthened through expanding a receptive field and a contact context, performance complementation is realized through weighting fusion, and the method is very effective particularly for images with obvious change of longitudinal crowd density caused by an obvious perspective relation.
In one embodiment, the step S3: carrying out self-adaptive adjustment on the data enhancement network and the multi-branch density estimator based on the style transfer learning according to a conversion rate index, wherein the conversion rate index
Figure BDA0003026915190000092
Wherein nd represents the MAE value of the model when the data enhancement of the step S1 is not performed, st represents the MAE value when the model is directly used in the cross-domain real scene, and Q represents the current MAE value of the model to be analyzed;
when C is presentrateAnd when the target domain image does not reach the preset threshold value, the target domain image can be increased step by step in proportion to form a new sample pair, the data enhancement network based on the style migration learning is subjected to enhancement training, and the learning rates of different branches in the multi-branch density estimator are adjusted at the stage when the data enhancement network tends to be gentle.
The example of the invention provides a conversion index CrateFor automatically evaluating the current data enhancement and scene aware status. When C is presentrateAnd when the target domain images do not reach the preset threshold value, the target domain images can be increased step by step in proportion to form a new sample pair, the data enhancement network based on style migration learning is subjected to enhancement training, and the learning rates of different branches in the multi-branch density estimator are adjusted at the stage when the data enhancement network tends to be gentle, so that the scene classification precision is enhanced, and the training is optimized.
In the model training process, when the crowd density estimation is carried out on the target domain image, the current network needs to be updated in real time so as to keep the optimal adaptability and generalization capability of the model. The conversion rate provided by the invention can dynamically evaluate whether the data enhancement and scene perception of the current model are saturated or not, and if the current model has an ascending space, the model is subjected to fine adjustment in a self-adaptive manner, so that an intelligent people counting model which is most suitable for the current monitoring scene is obtained, the number of people in the monitoring video/image is rapidly predicted and analyzed, necessary early warning is provided for workers, and potential safety hazards such as trampling, illegal gathering, road congestion and the like are effectively avoided.
The trained model is tested and applied in a target actual scene through the steps, an equal-interval key frame analysis mode is adopted for video data, a static image is directly analyzed, and if the number of people meeting statistics exceeds 100, early warning is automatically carried out and the static image is sent to relevant workers.
Example two
As shown in fig. 7, an embodiment of the present invention provides a cross-domain adaptive people counting system based on transfer learning and scene perception, including the following modules:
a data enhancement module for forming a sample pair of the source domain image and the target domain image, training a data enhancement network based on style migration learning, and performing style similarity L on the output imageSSCContent similarity LCTCConsistency L of cyclic reconstructionCYCAnd statistical similarity LCSCObtaining false images of a source domain and a target domain by four constraint measurements;
the crowd density estimation module is used for inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, the obtained density characteristic graph is fused with scene perception weight through a multi-branch density estimator to predict crowd density;
and the self-adaptive adjusting module is used for self-adaptively adjusting the data enhancement network and the multi-branch density estimator based on the style migration learning according to the conversion rate index.
The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims (6)

1. A cross-domain adaptive people counting method based on transfer learning and scene perception is characterized by comprising the following steps:
step S1: using a sample pair composed of the source domain image and the target domain image to train a data enhancement network based on style migration learning, and performing style similarity loss L on the output imageSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCObtaining false images of a source domain and a target domain by four constraint measurements;
step S2: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, fusing the obtained density characteristic graph with the scene perception weight through a multi-branch density estimator to predict crowd density;
step S3: and according to the conversion rate index, carrying out self-adaptive adjustment on the data enhancement network based on the style transfer learning and the multi-branch density estimator.
2. The cross-domain adaptive people counting method based on transfer learning and scene perception according to claim 1, wherein the method is characterized in thatThe step S1: using a sample pair composed of the source domain image and the target domain image to train a data enhancement network based on style migration learning, and performing style similarity loss L on the output imageSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCThe four constraint metrics to obtain the source domain and target domain false images specifically include:
step S11: using source domain image S ═ { S } with labeled density graph1,s2,…,snAnd the unlabeled target domain image T ═ T1,t2,…,tnForm training sample pairs<si,tj>(ii) a Wherein s isiFor the ith source domain image, tjIs the jth target domain image;
step S12: for the training sample pair<si,tj>Training the data enhancement network based on the style migration learning and outputting a generated image gSi-TjAnd gTj-SiAnd subjecting it to loss of style similarity LSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCAnd obtaining a trained data enhancement network based on the style migration learning by four constraint metrics.
3. The cross-domain adaptive people counting method based on transfer learning and scene perception according to claim 2, wherein the step S12: for the training sample pair<si,tj>Training the data enhancement network based on the style migration learning and outputting a generated image gSi-TjAnd gTj-SiAnd subjecting it to loss of style similarity LSSCContent similarity loss LCTCCyclic reestablishment of consistency loss LCYCAnd statistical similarity loss LCSCThe trained data enhancement network based on the style migration learning is obtained by the four constraint metrics, and the method specifically comprises the following steps:
step S121: the training sample pairs<si,tj>Inputting the data enhancement network based on the style migration learning, and outputting the generated image gSi-TjAnd gTj-SiThe style similarity loss L between the corresponding target domain image and the corresponding target domain image is calculated in two ways through the following formulas (1) to (3)SSCTo constrain said image gSi-TjWith the target field image tjThe visual style of (1); at the same time, constraining the image gTj-SiWith the source domain image siThe visual style of (1);
Figure FDA0003026915180000021
Figure FDA0003026915180000022
Figure FDA0003026915180000023
wherein Gram (·) represents a Gram matrix and is used for extracting image scene style information; w and h represent the width and height of the image, respectively;
Figure FDA0003026915180000024
indicating the loss from the source domain to the target domain,
Figure FDA0003026915180000025
the loss from the target domain to the source domain is represented, the two calculation modes are the same, the calculation directions are reciprocal, and the two-way loss L is formed by the twoSSC
Step S122: content similarity loss L between the generated image and the source domain image by bi-directional computationCTCConstraining generation of the generated image gSi-TjWith the source domain image siThe visual content of (a); at the same time, the generated image g is constrainedTj-SiWith the target image tjThe visual content of (a);
step S123: the cycle reconstruction consistency loss L is calculated using the following calculation equation (4)CYCConstraining the source domain image siThe generated image gSi-TjCan be changed back to the source domain image s in a loopi(ii) a At the same time, the target image t is constrainedjGenerated image gTj-SiCyclically changing back to said target image tj
Figure FDA0003026915180000026
Wherein G isSAnd GTRespectively representing generators for constructing a cycle consistency migration network; l isCYCBy reciprocal
Figure FDA0003026915180000027
And
Figure FDA0003026915180000028
the two parts are formed into a whole body,
Figure FDA0003026915180000029
constraining the recurring reconstruction consistency of the source domain images,
Figure FDA00030269151800000210
constraining the consistency of the circular reconstruction of the target domain image;
step S124: the statistical similarity loss L is calculated using the following equations (5) to (7)CSCSo that the target domain generation image can use the annotation information of the same content in the source domain image;
Figure FDA00030269151800000211
Figure FDA00030269151800000212
Figure FDA00030269151800000213
wherein, P () represents the density characteristic diagram of the dense scene passing through the crowd density predictor;
step S125: utilizing the following formula (8) to balance and control the influence of the four constraints on style and content extraction training to obtain the false images of the source domain and the target domain;
L*=α1LCYC2LCSC3LCTC4LSSC (8)
wherein, { α [ [ alpha ] ]1,α2,α3,α4And the symbol is super reference.
4. The cross-domain adaptive people counting method based on transfer learning and scene perception according to claim 1, wherein the step S2: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; and fusing the obtained density characteristic graph with the scene perception weight through a multi-branch density estimator to predict the crowd density, wherein the method specifically comprises the following steps:
step S21: inputting the source domain and target domain false images into the scene perception classifier for classification to obtain a scene classification, and obtaining a corresponding scene perception weight P ═ { P }1,p2,p3}; wherein the scene classification includes: dense scenes, sparse scenes, and medium density scenes;
step S22: inputting the classified source domain and target domain false images into a multi-branch density estimator, and selecting corresponding branches for estimation according to the classification of the source domain and target domain false images to obtain corresponding density characteristic graphs; wherein the multi-branch density estimator comprises:
the first branch is a density estimator of the dense scene to obtain a density characteristic diagram of the dense scene;
the second branch is a density estimator of the sparse scene to obtain a density characteristic diagram of the sparse scene;
the third branch is a density estimator of the medium-density scene to obtain a density characteristic diagram of the medium-density scene;
step S23: fusing the density characteristic graph with the corresponding scene perception weight, and predicting the crowd density by using the following formula (9);
Figure FDA0003026915180000031
wherein, <' > indicates a bit-wise product, pcRepresenting the probability of a sample belonging to class c scenarios, EcCharacteristic diagram representing the c-th branch prediction, I calculated by the formulaFinalIs the final fused estimated population density map.
5. The cross-domain adaptive people counting method based on transfer learning and scene perception according to claim 1, wherein the conversion rate index
Figure FDA0003026915180000032
Wherein nd represents the MAE value of the model when the data enhancement of the step S1 is not performed, st represents the MAE value when the model is directly used in the cross-domain real scene, and Q represents the current MAE value of the model to be analyzed;
when C is presentrateAnd when the target domain images do not reach the preset threshold value, the target domain images can be increased step by step in proportion to form a new sample pair, the data enhancement network based on the style migration learning is subjected to enhancement training, and the learning rates of different branches in the multi-branch density estimator are adjusted at the stage when the data enhancement network tends to be gentle.
6. A cross-domain adaptive people counting system based on transfer learning and scene perception is characterized by comprising the following modules:
a data enhancement module for forming a sample pair of the source domain image and the target domain image, training a data enhancement network based on style migration learning, and performing style similarity L on the output imageSSCContent similarity LCTCConsistency L of cyclic reconstructionCYCAnd statistical similarity LCSCObtaining false images of a source domain and a target domain by four constraint measurements;
the crowd density estimation module is used for inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, fusing the obtained density characteristic graph with the scene perception weight through a multi-branch density estimator, predicting the crowd density and estimating the crowd density;
and the self-adaptive adjusting module is used for self-adaptively adjusting the data enhancement network based on the style migration learning and the multi-branch density estimator according to the conversion rate index.
CN202110418583.XA 2021-04-19 2021-04-19 Cross-domain self-adaptive people counting method based on transfer learning and scene perception Pending CN113095246A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110418583.XA CN113095246A (en) 2021-04-19 2021-04-19 Cross-domain self-adaptive people counting method based on transfer learning and scene perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110418583.XA CN113095246A (en) 2021-04-19 2021-04-19 Cross-domain self-adaptive people counting method based on transfer learning and scene perception

Publications (1)

Publication Number Publication Date
CN113095246A true CN113095246A (en) 2021-07-09

Family

ID=76678512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110418583.XA Pending CN113095246A (en) 2021-04-19 2021-04-19 Cross-domain self-adaptive people counting method based on transfer learning and scene perception

Country Status (1)

Country Link
CN (1) CN113095246A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642403A (en) * 2021-07-13 2021-11-12 重庆科技学院 Crowd abnormal intelligent safety detection system based on edge calculation
CN113837191A (en) * 2021-08-30 2021-12-24 浙江大学 Cross-satellite remote sensing image semantic segmentation method based on bidirectional unsupervised domain adaptive fusion
CN114707402A (en) * 2022-03-09 2022-07-05 中国石油大学(华东) Method for converting curling simulation image into real image by reinforcement learning perception

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783610A (en) * 2020-06-23 2020-10-16 西北工业大学 Cross-domain crowd counting method based on de-entangled image migration
CN112131967A (en) * 2020-09-01 2020-12-25 河海大学 Remote sensing scene classification method based on multi-classifier anti-transfer learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783610A (en) * 2020-06-23 2020-10-16 西北工业大学 Cross-domain crowd counting method based on de-entangled image migration
CN112131967A (en) * 2020-09-01 2020-12-25 河海大学 Remote sensing scene classification method based on multi-classifier anti-transfer learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NA JIANG等: "DAPC:Domain Adaptation People Counting via Style-level Transfer Learning and Scene-aware Estimation", 《2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642403A (en) * 2021-07-13 2021-11-12 重庆科技学院 Crowd abnormal intelligent safety detection system based on edge calculation
CN113837191A (en) * 2021-08-30 2021-12-24 浙江大学 Cross-satellite remote sensing image semantic segmentation method based on bidirectional unsupervised domain adaptive fusion
CN113837191B (en) * 2021-08-30 2023-11-07 浙江大学 Cross-star remote sensing image semantic segmentation method based on bidirectional unsupervised domain adaptive fusion
CN114707402A (en) * 2022-03-09 2022-07-05 中国石油大学(华东) Method for converting curling simulation image into real image by reinforcement learning perception

Similar Documents

Publication Publication Date Title
Patrikar et al. Anomaly detection using edge computing in video surveillance system
Fan et al. A survey of crowd counting and density estimation based on convolutional neural network
Javan Roshtkhari et al. Online dominant and anomalous behavior detection in videos
CN108921051B (en) Pedestrian attribute identification network and technology based on cyclic neural network attention model
CN113095246A (en) Cross-domain self-adaptive people counting method based on transfer learning and scene perception
CN109508360B (en) Geographical multivariate stream data space-time autocorrelation analysis method based on cellular automaton
CN111723693B (en) Crowd counting method based on small sample learning
CN113536972B (en) Self-supervision cross-domain crowd counting method based on target domain pseudo label
WO2023207742A1 (en) Method and system for detecting anomalous traffic behavior
Asad et al. Anomaly3D: Video anomaly detection based on 3D-normality clusters
Hu et al. Parallel spatial-temporal convolutional neural networks for anomaly detection and location in crowded scenes
CN112115849A (en) Video scene identification method based on multi-granularity video information and attention mechanism
Wang et al. Crowdmlp: Weakly-supervised crowd counting via multi-granularity mlp
Duan et al. Sofa-net: Second-order and first-order attention network for crowd counting
CN112819063A (en) Image identification method based on improved Focal loss function
Pang et al. Federated learning for crowd counting in smart surveillance systems
Qureshi et al. Neurocomputing for internet of things: object recognition and detection strategy
Hasan et al. Estimating traffic density on roads using convolutional neural network with batch normalization
Zhang et al. A spatiotemporal graph wavelet neural network for traffic flow prediction
CN116503776A (en) Time-adaptive-based space-time attention video behavior recognition method
Ren et al. Student behavior detection based on YOLOv4-Bi
Jebur et al. Abnormal Behavior Detection in Video Surveillance Using Inception-v3 Transfer Learning Approaches
CN112926517B (en) Artificial intelligence monitoring method
Annamalai et al. EvAn: Neuromorphic event-based sparse anomaly detection
Vu et al. Anomaly detection in surveillance videos by future appearance-motion prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210709