CN113095246A

CN113095246A - Cross-domain self-adaptive people counting method based on transfer learning and scene perception

Info

Publication number: CN113095246A
Application number: CN202110418583.XA
Authority: CN
Inventors: 姜那; 温兴森; 许�鹏; 施智平
Original assignee: Capital Normal University
Current assignee: Capital Normal University
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-07-09

Abstract

The invention relates to a cross-domain self-adaptive people counting method and a system based on transfer learning and scene perception, wherein the method comprises the following steps: s1: training a sample pair consisting of a source domain image and a target domain image to a data enhancement network based on style migration learning, and performing four-term constraint measurement on an output image to obtain a source domain false image and a target domain false image; s2: inputting a real image of a source domain and a false image of a target domain into a scene perception classifier and a multi-branch density estimator to predict crowd density; s3: and according to the conversion rate index, carrying out self-adaptive adjustment on the data enhancement network and the multi-branch density estimator. According to the method, data association of a lower layer is built through data enhancement, and a high-level knowledge bridge is built through joint training; the probability that the scene perception prediction image belongs to different scenes is utilized to provide perception weight, high-adaptability crowd density estimation is achieved in a weighting fusion mode, and the generalization and adaptability of the model are guaranteed.

Description

Cross-domain self-adaptive people counting method based on transfer learning and scene perception

Technical Field

The invention relates to the field of computer vision, in particular to a cross-domain self-adaptive people counting method and system based on transfer learning and scene perception.

Background

People counting is a basic research problem in the field of computer vision, is also a key task in intelligent analysis of monitoring videos, and is especially important for application of traffic control, abnormal early warning, flow analysis and the like. It mainly focuses on automatically estimating the crowd density in real surveillance videos or images using intelligent algorithms. As a matter of research value, has attracted much attention from the industrial and academic circles, and many innovative methods have been proposed one after another, greatly promoting its rapid development. For example, Wang et al first introduced an end-to-end demographics framework proposed by convolution structure, a multi-column convolution structure designed by Zhang et al, and so on. In these methods, deep learning is well-certified for the effectiveness of demographics. On the basis of the method, Liu et al proposes DecideNet containing detection and regression branches, Li et al proposes CSRNet focusing on high-density scenes, Sindagi et al proposes a network fusing two characteristics from top to bottom and from bottom to top, and the people counting precision on a closed finite set is further promoted to a certain extent through multi-task, multi-scale and other designs.

However, due to the reasons of limited training set range, wide practical data, difficulty in labeling and the like, the technologies or platforms related to people counting currently face the limitation of insufficient generalization capability when applied to real open monitoring scenes. This means that in an open real environment, the algorithmic model needs to be continuously improved to adapt to cross-domain data and multi-type scenes.

Indeed, domain adaptation problems have been explored in many other visual tasks, such as segmentation, detection, object re-recognition, etc. For example, Sankaranarayana et al have employed Generative Adaptive Networks (GANs) to mitigate data distribution differences between source and target domains. Inoue et al propose a weak supervised training method to achieve domain adaptation. Deng et al and Zhong et al introduced migratory learning. The principle of all these methods lies in the creation of a bridge of knowledge transfer between the source domain and the target domain by means of migration learning, thereby essentially improving the generalization ability of the model in different tasks. These innovations and contributions are not open to the rapid development of GANs. In 2019, Wang and Gao et al successively put forward SeCycleGan and DACC processing domain adaptive people counting, and through the combined training of synthetic sample enhanced data and original data, density estimation knowledge obtained in an active labeling domain is migrated, and remarkable generalization capability improvement is obtained. However, a key problem is ignored, that is, the composite image and the real monitoring image have many differences in illumination, scene and the like, which causes the people counting task which is sensitive to scene change to be easily trapped in local optimization. Worse, the difference between the synthesized data and the real data is sometimes larger than the difference between the real data of different cameras, and the storage idleness and generation consumption of the synthesized data are extremely high.

Therefore, how to set the real closed data as the source domain and practically and effectively establish the data association between the source domain and the actual open target domain so as to realize the knowledge migration beneficial to people counting becomes a problem to be solved urgently.

Disclosure of Invention

In order to solve the technical problems, the invention provides a cross-domain adaptive people counting method and system based on transfer learning and scene perception.

The technical solution of the invention is as follows: a cross-domain adaptive people counting method based on transfer learning and scene perception comprises the following steps:

step S1: using a sample pair composed of the source domain image and the target domain image to train a data enhancement network based on style migration learning, and performing style similarity loss L on the output image_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCAnd statistical similarity loss L_CSCObtaining false images of a source domain and a target domain by four constraint measurements;

step S2: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, fusing the obtained density characteristic graph with the scene perception weight through a multi-branch density estimator to predict crowd density;

step S3: and according to the conversion rate index, carrying out self-adaptive adjustment on the data enhancement network based on the style transfer learning and the multi-branch density estimator.

Compared with the prior art, the invention has the following advantages:

1. the method provided by the invention introduces style migration learning into the data enhancement technology and designs style similarity L_SSCContent similarity L_CTCConsistency L of cyclic reconstruction_CYCAnd statistical similarity L_CSCThe method has the advantages that four constraints are used, data association between the source domain and the target domain is enhanced, the method is different from the conventional method for generating the image density map by dynamic estimation, and the density map of the source domain image is used, so that a medium is provided for knowledge transfer, and meanwhile, errors accumulated by the density map of the sample generated by dynamic estimation can be effectively reduced.

2. The method provided by the invention provides a scene perception estimation concept applied to crowd density estimation, obtains perception weight by classifying sample scenes, avoids negative influence on parameter learning caused by alternate change of different scene samples during training, and fully exerts the function of each sample. Meanwhile, the multi-branch design adopts different structures, after front background information is distinguished through shared convolution, attention modes for people counting in different scenes are strengthened through expanding a receptive field and a contact context, performance complementation is realized through weighting fusion, and the method is very effective particularly for images with obvious change of longitudinal crowd density caused by an obvious perspective relation.

3. Crowd density estimation is carried out in a real scene, and the current network needs to be updated all the time so as to keep the optimal adaptability and generalization capability of the model. The conversion rate in the method dynamically evaluates whether the data enhancement and scene perception of the current model are saturated or not, and if the current model has an ascending space, the model is subjected to self-adaptive fine adjustment, so that an intelligent people counting model which is most suitable for the current monitoring scene is obtained, the number of people in the monitoring video/image is rapidly predicted and analyzed, necessary early warning is provided for workers, and potential safety hazards such as trampling, illegal gathering, road congestion and the like are effectively avoided.

Drawings

FIG. 1 is a flowchart of a cross-domain adaptive people counting method based on transfer learning and scene perception according to an embodiment of the present invention;

fig. 2 is a block diagram of a step S1 in a cross-domain adaptive people counting method based on transfer learning and scene perception in an embodiment of the present invention: using a sample pair composed of the source domain image and the target domain image to train a data enhancement network based on style migration learning, and performing style similarity loss L on the output image_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCAnd statistical similarity loss L_CSCObtaining a flow chart of the false images of the source domain and the target domain by the four constraint measurements;

fig. 3 is a block diagram of a step S12 in a cross-domain adaptive people counting method based on transfer learning and scene perception in an embodiment of the present invention: for training sample pairs<s_i,t_j>Training the data enhancement network based on style migration learning, and outputting a generated image g^Si-TjAnd g^Tj-SiAnd subjecting it to loss of style similarity L_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCAnd statistical similarity loss L_CSCFour constraint measures are carried out to obtain a flow chart of the trained data enhancement network based on the style migration learning;

FIG. 4 is a schematic structural diagram of a data enhancement network based on style migration learning according to an embodiment of the present invention;

fig. 5 is a block diagram of a step S2 in the cross-domain adaptive people counting method based on transfer learning and scene perception in the embodiment of the present invention: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, the obtained density characteristic graph is fused with scene perception weight through a multi-branch density estimator, and a flow chart for predicting crowd density is obtained;

FIG. 6 is a schematic structural diagram of a training scene classifier and a multi-branch density estimator according to an embodiment of the present invention;

fig. 7 is a block diagram illustrating a structure of a cross-domain adaptive people counting system based on transfer learning and scene awareness in an embodiment of the present invention.

Detailed Description

The invention provides a cross-domain self-adaptive people counting method based on transfer learning and scene perception, which effectively establishes data association between the cross-domain self-adaptive people counting method and an actual open target domain, provides perception weight by utilizing the probability that a scene perception predicted image belongs to different scenes, and realizes high-adaptability crowd density estimation in a weighting fusion mode.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings.

Example one

As shown in fig. 1, an embodiment of the present invention provides a cross-domain adaptive people counting method based on transfer learning and scene perception, including the following steps:

in the step, a false image is generated by learning the source domain content and the target domain style, and in the process, the style similarity, the content similarity, the cyclic reconstruction consistency and the statistical similarity of people number are utilized to restrict and reduce a possible mapping space and keep an effective knowledge migration bridge.

Step S2: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, the obtained density characteristic graph is fused with scene perception weight through a multi-branch density estimator to predict crowd density;

the scene perceptron classifier in the step predicts the probability that the input sample belongs to different scene classifications and takes the probability as a perception weight, and then the perception weight and the density characteristic diagram extracted by the corresponding branch are fused through the multi-branch density estimator to predict the crowd density, thereby realizing the complementation of the people counting results in different observation modes and ensuring that the method provided by the invention can adapt to the real monitoring environment with rich scene types.

Step S3: and according to the conversion rate index, carrying out self-adaptive adjustment on the data enhancement network and the multi-branch density estimator based on the style migration learning.

As shown in fig. 2, in one embodiment, the step S1: using a sample pair composed of the source domain image and the target domain image to train a data enhancement network based on style migration learning, and performing style similarity loss L on the output image_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCAnd statistical similarity loss L_CSCThe four constraint metrics to obtain the source domain and target domain false images specifically include:

step S11: using source domain image S ═ { S } with labeled density graph₁,s₂,…,s_nAnd the unlabeled target domain image T ═ T₁,t₂,…,t_nForm training sample pairs<s_i,t_j>(ii) a Wherein s is_iFor the ith source domain image, t_jIs the jth target domain image;

in this step, the source domain data with density label is used as the original image S ═ S₁，s₂，...，s_nAnd randomly selecting 5 images T ═ T in the target domain in the same number as the source data₁，t₂，...，t_nAs a target image. Considering the operation performance and the work efficiency of the video card, the embodiment of the invention adjusts all data to 512 × 512 size and forms the input pair from the source domain to the target domain<s_i,t_j>And (4) listing.

Step S12: for training sample pairs<s_i,t_j>Training the data enhancement network based on style migration learning, and outputting a generated image g^Si-TjAnd g^Tj-SiAnd subjecting it to style similarity lossL_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCAnd statistical similarity loss L_CSCAnd obtaining a trained data enhancement network based on the style migration learning by four constraint metrics.

The embodiment of the invention is improved on the basis of the GANs structure, and the data enhancement network based on the style migration learning is constructed. The network consists of two generators and two discriminators, wherein the generators have the same structure but do not share parameters, and the generators consist of two convolutions with step size of 2, ten residual modules, and two deconvolutions with gradient of 1/2. In order to include more population and background details in the generated false image, a self-attention model is also introduced in the residual module. And the two classification loss functions adopted by the discriminator are used for judging whether the image is true or false and updating the parameters of the discriminator. The discriminator introduces spectral normalization for extracting high semantic features, and is very effective for dense scenes or scenes with obvious perspective relation. Meanwhile, the invention also designs style similarity loss L_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCStatistical similarity loss L_CSCAnd four constraints for constraining the generated image.

As shown in fig. 3, in one embodiment, the step S12: for training sample pairs<s_i,t_j>Training the data enhancement network based on style migration learning, and outputting a generated image g^Si-TjAnd g^Tj-SiAnd subjecting it to loss of style similarity L_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCAnd statistical similarity loss L_CSCThe four constraint metrics are used for obtaining a trained data enhancement network based on style migration learning, and the method specifically comprises the following steps:

step S121: will train the sample pair<s_i,t_j>Inputting a data enhancement network based on style migration learning, and outputting a generated image g^Si-TjAnd g^Tj-SiThe style similarity loss L between the corresponding target domain image and the corresponding target domain image is calculated in two ways through the following formulas (1) to (3)_SSCBy usingTo constrain the image g^Si-TjWith target field image t_jThe visual style of (1); at the same time, image g is constrained^Tj-SiWith source-domain image s_iThe visual style of (1);

wherein Gram (·) represents a Gram matrix and is used for extracting image scene style information; w and h represent the width and height of the image, respectively;

indicating the loss from the source domain to the target domain,

the loss from the target domain to the source domain is represented, the two calculation modes are the same, the calculation directions are reciprocal, and the two-way loss L is formed by the two_SSC；

In this step, style similarity loss L is used in the reciprocal training process from source domain to target domain and vice versa by calculating style differences between the target image expressed by the gram matrix and the generated image_SSCAnd performing supervision so as to achieve the aim of transferring the target style to the generated image.

Step S122: content similarity loss L between generated image and source domain image by bi-directional computation_CTCConstraining the generated image g^Si-TjWith source-domain image s_iThe visual content of (a); at the same time, the image g is constrained to be generated^Tj-SiWith a target image t_jThe visual content of (a);

in this step, to make generationThe image has the image content contained in the source domain image, so that the subsequent multi-branch density estimator can be supervised by using the same labeled image, and the content similarity loss L between the image and the source domain image is generated by bidirectional calculation_CTCTo save the population distribution content in the source domain image and constrain the generated image g^Si-TjHaving an image s_iIs the visual content of (a) and simultaneously the image g^Tj-SiHaving an image t_jThe visual content of (2). Calculating content similarity loss L_CTCThe formula of (a) is consistent with the MSE loss function.

Step S123: the cycle reconstruction consistency loss L is calculated using the following calculation equation (4)_CYCBeam source field image s_iGenerated image g^Si-TjCan be circularly changed back to the source domain image s_i(ii) a At the same time, the target image t is constrained_jGenerated image g^Tj-SiCyclically changing back to the target image t_j；

Wherein G is_SAnd G_TGenerators respectively representing a cycle consistency migration network; l is_CYCBy reciprocal

And

the two parts are formed into a whole body,

the consistency of the cyclic reconstruction of the beam domain image,

and (5) restraining the consistency of the circular reconstruction of the target domain image.

For the learning based on the idea of generating fighting game, the learning only occurs in the high-dimensional feature space, which results in more possibility of mutual conversion. To reduce this possible spatial redundancy, theThe conversion tends to favor the people counting task, and the consistency loss L needs to be rebuilt by using circulation_CYCSupervision and restriction, calculating the bidirectional L by equation (4) above₁Distance is achieved. After the source domain image is generated, the source domain image is reconstructed back to the source domain image through different parameter networks with the same structure, and similarly, after the target domain image is generated, the target domain image is reconstructed back to the target domain image through different parameter networks with the same structure, so that the possible knowledge migration space of the high-dimensional feature space is reduced, and the direction which is most beneficial to people counting approaches is reached.

Step S124: the statistical similarity loss L is calculated using the following equations (5) to (7)_CSCSo that the target domain generation image can use the annotation information of the same content in the source domain image;

where P () represents the density profile of a dense scene passing through the crowd density predictor.

The data enhancement network based on the style migration learning provided by the invention not only has style and content constraints, but also ensures that the data enhancement image can use the annotation information of the source domain and the content image. Therefore, on the basis of the three constraints, a multi-branch density estimator is introduced into the network, people number statistics evaluation is carried out on the generated images and the source domain images, and the measurement of three channels among the images is upgraded to a single-channel density map space, so that the generated images are guaranteed to have a target domain style and source domain content, and meanwhile, the annotation information of the source domain and content images can be used.

Step S125: utilizing the following formula (8) to balance and control the influence of the four constraints on style and content extraction training to obtain false images of a source domain and a target domain;

L^*＝α₁L_CYC+α₂L_CSC+α₃L_CTC+α₄L_SSC (8)

wherein, { α [ [ alpha ] ]₁，α₂，α₃，α₄And the symbol is super reference.

In order to fully exert the supervision effects of different loss functions, the embodiment of the invention adopts a joint training mode to simultaneously use the loss functions, designs the loss weight according to experience, and performs weighted calculation on the total loss. Loss of style similarity L_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCAnd statistical similarity loss L_CSCThe corresponding weights are 2, 0.5, 0.1 and 0.01, respectively, whose numerical magnitudes represent the effect of different losses on the generated image.

FIG. 4 is a schematic diagram of a data enhancement network based on style migration learning, which designs style similarity L_SSCContent similarity L_CTCConsistency L of cyclic reconstruction_CYCAnd statistical similarity L_CSCThe method has the advantages that four constraints are used, data association between the source domain and the target domain is enhanced, the method is different from the conventional method for generating the image density map by dynamic estimation, and the density map of the source domain image is used, so that media are provided for knowledge transfer, and errors accumulated by the generation of the sample density map by dynamic estimation can be effectively reduced.

As shown in fig. 5, in one embodiment, the step S2: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; and then fusing the obtained density characteristic diagram with scene perception weight through a multi-branch density estimator to predict crowd density, wherein the method specifically comprises the following steps:

step S21: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain a scene classification, and obtaining a corresponding scene perception weight P ═ { P ═ P }₁,p₂,p₃}; wherein the scene classification includes: dense scenes, sparse scenes, and medium density scenes;

to train the scene classifier, 3,000 images were manually labeled, each type of data containing 1,000 images. The annotation is based on the number of people in the scene, the scene with more than 100 people is divided into dense scenes, the scenes with less than 30 people are summarized into sparse scenes, and the image between the dense scenes and the sparse scenes is the medium-density scene. The scene classifier can be trained by utilizing the labeled data, and is mainly used for learning the perception weight P ═ P₁,p₂,p₃Each of the weights reflects the impact of a different sample on updating the network branches, respectively. The scene with the probability of less than 30 persons is regarded as a sparse scene, the scene with the probability of more than 100 persons is regarded as a dense scene, the scenes with the number of other persons are regarded as a medium-density scene, and the specific training processes of the three scenes need to be independently completed.

Step S22: inputting the classified source domain and target domain false images into a multi-branch density estimator, and selecting corresponding branches for estimation according to the classification of the source domain and target domain false images to obtain corresponding density characteristic graphs; wherein the multi-branch density estimator comprises:

the first branch is a density estimator of the dense scene to obtain a density characteristic diagram of the dense scene;

the second branch is a density estimator of the sparse scene to obtain a density characteristic diagram of the sparse scene;

and the third branch is a density estimator of the medium-density scene, and a density characteristic diagram of the medium-density scene is obtained.

And (4) enabling the classified source domain false images and the classified target domain false images to enter a multi-branch density estimator. Considering that the knowledge attention modes of the scene statistics people with different densities are different, the invention designs three branches with different structures.

The first branch is a density estimator of the dense scene to obtain a density characteristic diagram of the dense scene; the embodiment of the invention mainly adopts convolution with the void ratio of 2.

The second branch is a density estimator of the sparse scene to obtain a density characteristic diagram of the sparse scene; the embodiment of the invention adopts convolution with a void ratio of 4. Both branches use hole convolution to enlarge the receptive field.

The third branch is a density estimator of the medium-density scene to obtain a density characteristic diagram of the medium-density scene; the embodiment of the invention introduces a self-attention module to learn the influence of different distance contexts on the estimation of the crowd density.

The multi-branch density estimator of the embodiment of the invention is composed of a convolution block sharing parameters and three branches not sharing the parameters, and the network structure configuration of each branch is shown in table 1:

table 1: three-branch network structure configuration of multi-branch density estimator

In table 1, K represents the convolution kernel size, S represents the step size, C represents the number of channels, D is the void ratio, and SA represents the self-attention module. The shared convolution block in the branch is responsible for distinguishing foreground information and background information, the dense scene branch uses low-void-rate convolution, the sparse scene branch uses high-void-rate convolution, and the receptive field is effectively expanded; other scene branches introduce a self-attention module to learn the influence of different distance contexts on the estimation of the crowd density. The attention modes of the three branch statistics are different due to different structures, and for a monitoring scene with perspective change, the three attention modes can be complemented.

Step S23: fusing the density characteristic diagram with the corresponding scene perception weight, and realizing the prediction of the crowd density by using the following formula (9);

wherein, <' > indicates a bit-wise product, p_cRepresenting the probability of a sample belonging to class c scenarios, E_cCharacteristic diagram representing the c-th branch prediction, I calculated by the formula_FinalIs the final fused estimated population density map.

To achieve the above mentioned attention pattern complementation, the present invention sets the scene perception weight P ═ P₁,p₂,p₃And (4) fusing the scene classification probability and the density feature map obtained in the step S22 through a formula (9). Therefore, each training sample can generate positive feedback on knowledge updating, and the influence of scene change on single branch training is fundamentally avoided.

As shown in the schematic structural diagram of the training scene classifier and the multi-branch density estimator shown in fig. 6, the method provided by the invention introduces the scene perception estimation concept into the crowd density estimation, obtains the perception weight by classifying the sample scene, avoids the negative influence on parameter learning caused by the alternate change of different scene samples during training, and fully exerts the function of each sample. Meanwhile, the multi-branch design adopts different structures, after front background information is distinguished through shared convolution, attention modes for people counting in different scenes are strengthened through expanding a receptive field and a contact context, performance complementation is realized through weighting fusion, and the method is very effective particularly for images with obvious change of longitudinal crowd density caused by an obvious perspective relation.

In one embodiment, the step S3: carrying out self-adaptive adjustment on the data enhancement network and the multi-branch density estimator based on the style transfer learning according to a conversion rate index, wherein the conversion rate index

Wherein nd represents the MAE value of the model when the data enhancement of the step S1 is not performed, st represents the MAE value when the model is directly used in the cross-domain real scene, and Q represents the current MAE value of the model to be analyzed;

when C is present_rateAnd when the target domain image does not reach the preset threshold value, the target domain image can be increased step by step in proportion to form a new sample pair, the data enhancement network based on the style migration learning is subjected to enhancement training, and the learning rates of different branches in the multi-branch density estimator are adjusted at the stage when the data enhancement network tends to be gentle.

The example of the invention provides a conversion index C_rateFor automatically evaluating the current data enhancement and scene aware status. When C is present_rateAnd when the target domain images do not reach the preset threshold value, the target domain images can be increased step by step in proportion to form a new sample pair, the data enhancement network based on style migration learning is subjected to enhancement training, and the learning rates of different branches in the multi-branch density estimator are adjusted at the stage when the data enhancement network tends to be gentle, so that the scene classification precision is enhanced, and the training is optimized.

In the model training process, when the crowd density estimation is carried out on the target domain image, the current network needs to be updated in real time so as to keep the optimal adaptability and generalization capability of the model. The conversion rate provided by the invention can dynamically evaluate whether the data enhancement and scene perception of the current model are saturated or not, and if the current model has an ascending space, the model is subjected to fine adjustment in a self-adaptive manner, so that an intelligent people counting model which is most suitable for the current monitoring scene is obtained, the number of people in the monitoring video/image is rapidly predicted and analyzed, necessary early warning is provided for workers, and potential safety hazards such as trampling, illegal gathering, road congestion and the like are effectively avoided.

The trained model is tested and applied in a target actual scene through the steps, an equal-interval key frame analysis mode is adopted for video data, a static image is directly analyzed, and if the number of people meeting statistics exceeds 100, early warning is automatically carried out and the static image is sent to relevant workers.

Example two

As shown in fig. 7, an embodiment of the present invention provides a cross-domain adaptive people counting system based on transfer learning and scene perception, including the following modules:

a data enhancement module for forming a sample pair of the source domain image and the target domain image, training a data enhancement network based on style migration learning, and performing style similarity L on the output image_SSCContent similarity L_CTCConsistency L of cyclic reconstruction_CYCAnd statistical similarity L_CSCObtaining false images of a source domain and a target domain by four constraint measurements;

the crowd density estimation module is used for inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, the obtained density characteristic graph is fused with scene perception weight through a multi-branch density estimator to predict crowd density;

and the self-adaptive adjusting module is used for self-adaptively adjusting the data enhancement network and the multi-branch density estimator based on the style migration learning according to the conversion rate index.

The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims

1. A cross-domain adaptive people counting method based on transfer learning and scene perception is characterized by comprising the following steps:

2. The cross-domain adaptive people counting method based on transfer learning and scene perception according to claim 1, wherein the method is characterized in thatThe step S1: using a sample pair composed of the source domain image and the target domain image to train a data enhancement network based on style migration learning, and performing style similarity loss L on the output image_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCAnd statistical similarity loss L_CSCThe four constraint metrics to obtain the source domain and target domain false images specifically include:

step S12: for the training sample pair<s_i,t_j>Training the data enhancement network based on the style migration learning and outputting a generated image g^Si-TjAnd g^Tj-SiAnd subjecting it to loss of style similarity L_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCAnd statistical similarity loss L_CSCAnd obtaining a trained data enhancement network based on the style migration learning by four constraint metrics.

3. The cross-domain adaptive people counting method based on transfer learning and scene perception according to claim 2, wherein the step S12: for the training sample pair<s_i,t_j>Training the data enhancement network based on the style migration learning and outputting a generated image g^Si-TjAnd g^Tj-SiAnd subjecting it to loss of style similarity L_SSCContent similarity loss L_CTCCyclic reestablishment of consistency loss L_CYCAnd statistical similarity loss L_CSCThe trained data enhancement network based on the style migration learning is obtained by the four constraint metrics, and the method specifically comprises the following steps:

step S121: the training sample pairs<s_i,t_j>Inputting the data enhancement network based on the style migration learning, and outputting the generated image g^Si-TjAnd g^Tj-SiThe style similarity loss L between the corresponding target domain image and the corresponding target domain image is calculated in two ways through the following formulas (1) to (3)_SSCTo constrain said image g^Si-TjWith the target field image t_jThe visual style of (1); at the same time, constraining the image g^Tj-SiWith the source domain image s_iThe visual style of (1);

indicating the loss from the source domain to the target domain,

Step S122: content similarity loss L between the generated image and the source domain image by bi-directional computation_CTCConstraining generation of the generated image g^Si-TjWith the source domain image s_iThe visual content of (a); at the same time, the generated image g is constrained^Tj-SiWith the target image t_jThe visual content of (a);

step S123: the cycle reconstruction consistency loss L is calculated using the following calculation equation (4)_CYCConstraining the source domain image s_iThe generated image g^Si-TjCan be changed back to the source domain image s in a loop_i(ii) a At the same time, the target image t is constrained_jGenerated image g^Tj-SiCyclically changing back to said target image t_j；

Wherein G is_SAnd G_TRespectively representing generators for constructing a cycle consistency migration network; l is_CYCBy reciprocal

And

the two parts are formed into a whole body,

constraining the recurring reconstruction consistency of the source domain images,

constraining the consistency of the circular reconstruction of the target domain image;

wherein, P () represents the density characteristic diagram of the dense scene passing through the crowd density predictor;

step S125: utilizing the following formula (8) to balance and control the influence of the four constraints on style and content extraction training to obtain the false images of the source domain and the target domain;

L^*＝α₁L_CYC+α₂L_CSC+α₃L_CTC+α₄L_SSC (8)

4. The cross-domain adaptive people counting method based on transfer learning and scene perception according to claim 1, wherein the step S2: inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; and fusing the obtained density characteristic graph with the scene perception weight through a multi-branch density estimator to predict the crowd density, wherein the method specifically comprises the following steps:

step S21: inputting the source domain and target domain false images into the scene perception classifier for classification to obtain a scene classification, and obtaining a corresponding scene perception weight P ═ { P }₁,p₂,p₃}; wherein the scene classification includes: dense scenes, sparse scenes, and medium density scenes;

the third branch is a density estimator of the medium-density scene to obtain a density characteristic diagram of the medium-density scene;

step S23: fusing the density characteristic graph with the corresponding scene perception weight, and predicting the crowd density by using the following formula (9);

5. The cross-domain adaptive people counting method based on transfer learning and scene perception according to claim 1, wherein the conversion rate index

when C is present_rateAnd when the target domain images do not reach the preset threshold value, the target domain images can be increased step by step in proportion to form a new sample pair, the data enhancement network based on the style migration learning is subjected to enhancement training, and the learning rates of different branches in the multi-branch density estimator are adjusted at the stage when the data enhancement network tends to be gentle.

6. A cross-domain adaptive people counting system based on transfer learning and scene perception is characterized by comprising the following modules:

the crowd density estimation module is used for inputting the false images of the source domain and the target domain into a scene perception classifier for classification to obtain corresponding scene perception weights; then, fusing the obtained density characteristic graph with the scene perception weight through a multi-branch density estimator, predicting the crowd density and estimating the crowd density;

and the self-adaptive adjusting module is used for self-adaptively adjusting the data enhancement network based on the style migration learning and the multi-branch density estimator according to the conversion rate index.