CN114120363A

CN114120363A - Pedestrian cross-mirror weight recognition method and system based on background and attitude normalization

Info

Publication number: CN114120363A
Application number: CN202111394021.2A
Authority: CN
Inventors: 陆科名; 刘文斌; 陈伟; 陈曦珑; 赵雪珺
Original assignee: SHANGHAI CRIMINAL SCIENCE TECHNOLOGY RESEARCH INSTITUTE
Current assignee: SHANGHAI CRIMINAL SCIENCE TECHNOLOGY RESEARCH INSTITUTE
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-03-01

Abstract

A pedestrian cross-mirror re-recognition method and system based on background and attitude normalization are characterized in that local features and global features of key points are extracted from a background normalized image of an original image through a two-way human body attitude estimation network, after coordinates and category information of the key points are fused into the features of the key points, the key points are input into a transform coding model trained by adopting a contrast learning strategy, and the similarity of features output by different targets is obtained and used as the basis for pedestrian cross-environment re-recognition; the invention can solve the problems of shadow shielding, illumination change, background switching, posture change, local shielding and the like commonly seen in the cross-monitoring pedestrian re-identification task to different degrees, and has higher practical value.

Description

Pedestrian cross-mirror weight recognition method and system based on background and attitude normalization

Technical Field

The invention relates to a technology in the field of image recognition, in particular to a pedestrian cross-mirror weight recognition method and system based on background and attitude normalization.

Background

The pedestrian re-identification is a technology for detecting whether a specific pedestrian exists in an image or a video by utilizing a computer vision algorithm, and can be used as a supplement of a face identification technology to search and track the pedestrian who cannot obtain a clear shot face. The pedestrian re-identification technology is applied to the field of video monitoring, a certain target pedestrian is selected from video pictures collected by different monitoring cameras in public places, the camera where the target pedestrian is located is used as the center, the pedestrian re-identification technology is used for searching the target in all the cameras in a certain range around, and the pedestrian movement track of the target is determined.

The pedestrian re-identification field still has some difficult points at present and has not been solved yet, the scene of different camera height, angle, resolution ratio and control is difficult to unify in the reality, illumination, the weather condition of different times have the difference, pedestrian's gesture, orientation constantly change, in addition, the ubiquitous pedestrian shelters from the condition in the surveillance video, these factors have seriously restricted the application effect of pedestrian re-identification technique in the reality, especially cross the mirror and re-identification technique, the application effect in the security protection field still can not reach the in-service use demand at present.

Disclosure of Invention

The invention provides a method and a system for recognizing pedestrian cross-mirror based on background and attitude normalization aiming at the problem of low recognition recall rate caused by background interference and posture dissimilarity factors in the existing technology for recognizing pedestrian cross-mirror, wherein the method and the system realize the illumination normalization of pedestrians in different monitoring scenes through homomorphic filtering, realize the unification of pedestrian background information through example segmentation, and have better effects on the problems of uneven illumination and background dissimilarity in the problem of recognizing pedestrian cross-mirror; through a double-path attitude estimation network, 20 key point local characteristics of pedestrians including areas such as hairstyles, portable objects and the like and the global characteristics of human bodies are extracted and sent into a Transformer coding model trained by a contrast learning strategy to obtain corresponding characteristic codes, the Transformer coding model can capture the incidence relation among input characteristics, and the local characteristics of each key point are modeled again in the Transformer coding model, so that the interference of attitude differentiation on pedestrian re-identification can be effectively avoided, and the Transformer coding model adopts a mask strategy during training, so that the interference of a shielding problem on the characteristic codes can be inhibited; and finally, calculating the similarity of different pedestrian pictures through the cosine distance. The invention provides a double-path attitude estimation network and a Transformer coding model trained by adopting a contrast learning strategy, integrates illumination balance and background normalization, provides a set of pedestrian re-identification scheme capable of being used for real complex scenes, can solve the problems of shadow shielding, illumination change, background switching, attitude dissimilarity, local shielding and the like common in a cross-monitoring pedestrian re-identification task to different degrees, and has higher practical value.

The invention is realized by the following technical scheme:

the invention relates to a pedestrian cross-mirror re-recognition method based on background and posture normalization, which comprises the steps of extracting local characteristics and global characteristics of key points from a background normalized image of an original image through a two-way human posture estimation network, inputting the key points into a Transformer coding model trained by adopting a contrast learning strategy after coordinates and category information of each key point are fused into the key point characteristics, obtaining coding characteristics output by different targets, and solving the similarity of the coding characteristics as the basis of pedestrian cross-environment re-recognition.

The background normalization image of the original image is obtained by the following method: decoding an original image, and preprocessing a pedestrian picture by adopting a homomorphic filtering algorithm to obtain a uniform illumination picture set; and (3) carrying out example segmentation on the picture with uniform illumination, extracting the pedestrian image, and embedding the pedestrian image into a rectangular frame with RGB (255,255 and 255) filling colors to obtain a background normalized picture set.

The picture with uniform illuminationThe method comprises the following steps: decoding the original image, and recording the decoded image as F_in(ii) a Taking logarithm of the image pixel matrix, and recording as Z_in，Z_in＝ln(F_in+ 1); image transformation into the frequency domain, denoted X, using a fast fourier transform_ω，X_ω＝FFT(Z_in) (ii) a Filtering the frequency domain image by using a Gaussian high-pass filter H, wherein the filtering result is recorded as S, and S is H (X)_ω) (ii) a Restoring the frequency domain image to the airspace by using inverse fast Fourier transform to obtain Z_out，Z_outIfft(s); indexing the recovered airspace data to obtain an image with balanced illumination, and recording the image as F_out，F_out＝exp(Z_out)。

The background normalization picture set is obtained by the following specific method: extracting the characteristics of the whole picture by using a pre-trained target detection model, and predicting the coordinates of the candidate frame of the potential target area; cutting out a characteristic sub-image of the potential target area in the whole image characteristic according to the candidate frame coordinate; each element in the characteristic subgraph of the potential target area is taken as a node in the graph, the characteristic subgraph is input into a preposed graph convolution layer, the incidence relation between different elements is extracted, the mask information of a shelter in the potential target frame is obtained, and the boundary of the shelter and the sheltered target is decoupled; adding the potential target area feature subgraph and the feature processed by the preposed graph neural network, inputting a postposition graph convolution network, and outputting the segmentation results of the target area shielding object and the shielded target; and (4) extracting the segmented pedestrian target from the original picture, and embedding the segmented pedestrian target into a rectangular frame with the filling color of RGB (255,255 and 255) to obtain a background normalized picture set.

The two-way human body posture estimation network comprises: the system comprises a main network for extracting four paths of feature maps with different resolutions, a global feature aggregation module for aggregating the four paths of feature maps with different resolutions into global features, a feature super-resolution module for amplifying the four paths of feature maps to 4 times of the original resolution, and a multi-resolution heat map aggregation module for heat map prediction and heat map fusion.

Compared with the HRNet, the backbone network simplifies the final feature aggregation module.

The global feature aggregation module comprises: 1 × 1 convolutional layer and 3 × 3 convolutional layer.

The feature super-resolution module comprises: a bilinear interpolation layer and a convolution layer with step size 1.

The multi-resolution heat map aggregation module comprises: a channel merging unit containing 20 convolutional layers of 1 × 1 convolution for implementing heat map prediction and a bilinear interpolation layer and an average layer for heat map fusion.

The two-path human body posture estimation network limits the range of the heat map of the key points of the training data according to the size of the actual key area during training.

The step of integrating the coordinates and the category information of each key point into the key point features is as follows: respectively flattening the local features and the global features of the key points output by the two-path human body posture estimation network in an end-to-end connection mode, and then combining the flattened local features and the global features to obtain a feature embedded matrix; constructing a key point category embedding matrix in a sinusoidal coding mode, and adding the key point category embedding matrix and the feature embedding matrix; and embedding the position information of each key point into the feature matrix in a splicing mode.

In the training stage of the Transformer coding model, different input key point characteristics are set to be in a Gaussian white noise mode in a random extraction mode, and the real world shielding situation is simulated; the target function adopted in the training stage adopts cross entropy loss based on different sample characteristic similarities, and the training target is to make the similarities of different images of the same pedestrian larger and the similarities of the images of the same pedestrian smaller.

The invention relates to a system for realizing the pedestrian cross-mirror weight recognition method based on background and attitude normalization, which comprises the following steps: the video preprocessing module, the background normalization module, the key point feature extraction module, the feature coding module and the similarity calculation module, wherein: the video preprocessing module decodes the monitoring video, acquires an identified monitoring picture to be processed and performs homomorphic filtering to obtain a uniform illumination picture set; the background normalization module processes background information of the pedestrian target in the monitoring picture, separates the pedestrian target from the uniform illumination picture, and embeds the pedestrian target into a rectangular frame with RGB (255,255 and 255) filling colors to obtain a background normalization picture set; the key point feature extraction module processes the background normalized picture set, and obtains key point features and global features of the pedestrian target by adopting a double-path attitude estimation network to synthesize different scales to obtain a first feature set; the feature coding module further processes the first feature set, embeds key point positions and category information, models incidence relations of different key points, inhibits attitude dissimilarity and shields interference factors, and obtains a second feature set; and the similarity calculation module calculates the similarity of the second feature set of the recognition target and the object to be searched to obtain a pedestrian re-recognition result.

Technical effects

Compared with the prior art, the double-path attitude estimation network provided by the invention takes HRNet as a backbone network, extracts the global characteristics and the key point region characteristics of a pedestrian picture through a double-current characteristic extraction structure and the like in a characteristic super-resolution mode, defines key points more suitable for pedestrian re-identification in a training stage and standardizes the training heat map range of the key point region; different from a conventional visual transform coding model, a data image block adopted by the model is not obtained by directly segmenting an image, but is selectively extracted by a key point feature extraction module, and key point regional features, global features, key point categories and coordinate information of a pedestrian target are embedded in the data image block, a random mask layer is added during training, so that the random mask layer is more robust to the problem of local occlusion, and the similarity of different training samples is used as a label, so that the feature coding of the output model is more suitable for calculating the similarity of pedestrians; in addition, the cross-border pedestrian re-identification method provided by the invention integrates preprocessing means such as illumination balance and background normalization, so that interference information for key feature extraction and feature coding subsequently is less and more accurate.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of an illuminance equalization method;

FIG. 3 is a schematic of background normalization using example segmentation;

FIG. 4 is a schematic diagram of extracting key point features of human body posture and constructing a first feature set;

FIG. 5 is a schematic diagram of extracting feature association relations of different key points and constructing a second feature set;

FIG. 6 is a schematic diagram illustrating the effects of the embodiment.

Detailed Description

As shown in fig. 1, the present embodiment relates to a method for recognizing a pedestrian crossing mirror with normalized background and attitude, which includes the following steps:

s1, decoding the original monitoring video, extracting monitoring pictures of different scenes and different moments, performing homomorphic filtering, adjusting the illumination conditions in the monitoring pictures to be consistent, reducing the influence of shadow illumination factors on pedestrian re-identification, and obtaining an illumination balance picture set;

as shown in fig. 2, the homomorphic filtering specifically includes:

s11, decoding the image data F_inTaking the logarithm, thereby converting the multiplied incident component I and the reflected component R in the image into an additive form, i.e. from F_inConversion of I.R to lnF_in(ii) lnI + lnR, and the result after conversion is recorded as Z_inIn this embodiment, to avoid F_inIn the case where the number is 0, the logarithm may not be obtained, and in the case of F_inAll elements are first added with 1 before taking the logarithm.

S12, transforming Z by using fast Fourier transform_inConversion to the frequency domain, denoted X_ω.

S13, applying Gaussian high-pass filter to frequency domain data X_ωAnd carrying out high-pass filtering, wherein the filtering result is marked as S, and the reserved high-frequency data is frequency domain data of the reflection component R.

S14, restoring the frequency domain image to the space domain by using inverse fast Fourier transform, and recording as Z_out.

S15, finally, taking the natural constant as the base number to carry out the operation of fetching the index, and recovering to the normal image F_outTo form an illuminance balanceAnd (5) collecting pictures.

S2, performing example segmentation on the image set with balanced illumination, extracting a pedestrian image in the image set, and embedding the pedestrian image into a white rectangular frame to obtain a background normalized image set;

as shown in fig. 3, the background normalization in this embodiment includes steps of target detection, mask generation, and background synthesis, where the involved models are trained on locally manually labeled data.

And S21, selecting a rectangular area containing the pedestrian from the illumination balance picture by using the trained target detection model.

In this embodiment, the target detection model adopts a YOLOv5 model to perform pedestrian target selection, and the model performs 100 rounds of fine tuning training according to a preset pre-training model and a manually labeled monitoring view angle data set with balanced illuminance in a training stage, so as to obtain a trained target detection model. The output of this step is potential target rectangle bounding box information.

S22, according to the information of the rectangular boundary frame of the potential target, a trained example segmentation model containing two graph convolution layers is adopted to cut out a characteristic subgraph of the potential target area in the overall characteristics of the picture, and a loss function used in the training of the example segmentation model is the sum of classification loss, regression loss and center distance loss output by the example segmentation models of the two graph convolution layers.

S23, extracting the obstruction mask through a preceding graph neural network. Inputting the characteristic subgraph of the potential target area into a conventional convolutional layer, taking each element in the output characteristic graph as a node in the graph, inputting the node into one graph convolutional layer, and extracting the association relation between different elements. Specifically, the input features are multiplied by a weight matrix which is an adjacent matrix of a graph, the dimension of the graph is changed through 1 multiplied by 1 convolution layer, a Softmax operator is adopted for nonlinear transformation, and finally, the graph is activated through a Relu operator, so that the incidence relation among different region elements is obtained. And after the incidence relation between different elements is input into a full connection layer, mask information of the shielding object in the potential target frame is obtained. The weight parameters of the convolutional layer, the graph convolutional layer and the fully-connected layer in the step are obtained by training in a training set. The advantage of using the graph convolution network in this step is that the problem of non-connection of the same target pixel region due to occlusion can be better overcome.

And S24, extracting the mask of the occluded object through a post-graph neural network. This step is similar to step S23, except that the input of this step is the result of adding the potential target region feature sub-graph to the feature processed by the preceding graph neural network, and the output is the mask information of the occluded target.

S25, according to the mask information of the sheltering object and the sheltered object, the pedestrian object is extracted from the original picture and embedded into the rectangular frame with the filling color of pure white RGB (255 ), and a background normalized picture set is obtained.

S3, inputting the background normalized picture set into a trained two-way human body posture estimation network, and extracting 20 key point local features and global features of the pedestrian target to form a first feature set;

the 20 key points comprise: human top, eye, oral area, nape, left shoulder, right shoulder, left elbow, right elbow, left hand, right hand, left hand portable thing, right hand portable thing, go up trunk center, crotch, left knee, right knee, left ankle, right ankle, left foot, right foot, wherein: the top, the neck, the portable object and the upper trunk center part are used as prediction key points, so that the information of a hairstyle, a handbag and upper body clothes can be captured, and the pedestrian heavy identification scene is more suitable.

As shown in fig. 4, the two-way human body posture estimation network includes: backbone network, global feature aggregation module, feature super-resolution module and multiresolution heat map aggregation module, wherein: two paths of post-networks are connected behind the main network module in parallel, the global feature aggregation module serves as a first path of post-network, the feature super-resolution, multi-resolution heat map aggregation and key area selection modules which are connected in sequence serve as a second path of post-network, and weight parameters in the two paths of human body posture estimation networks are obtained by training on locally manually marked data.

The extraction comprises the following steps: extracting a multi-granularity heat map, selecting a key area and fusing features, and training all models and networks involved in the process on locally manually marked data to obtain the multi-granularity heat map.

And S31, constructing the HRNet without the terminal aggregation layer as a backbone network, and inputting the backbone network into a background normalization picture set to obtain four paths of feature maps with different scales as temporary feature sets.

S32, the global feature aggregation module serving as the first post-network converts 4 feature maps in the temporary feature set into feature maps with dimensions of 28 × 28 through convolution operations, adjusts the number of channels of each feature map to 1 by using 1 × 1 convolutional layers, and inputs the 4 28 × 28 single-channel feature maps into a convolutional layer containing 3 × 3 convolutional kernels, so as to obtain a feature map with dimensions of 28 × 28 × 3, which is used as the global feature of the background normalized picture.

And S33, the feature super-resolution module in the second path post-network performs bilinear interpolation operation on the feature map output by each branch in the temporary feature set, the resolution of the output feature map is expanded to be 2 times of the output of each branch in the temporary feature set in the width and height directions, namely the element number of the feature map is expanded to be 4 times, and then feature re-fusion is performed through a convolution layer with the step length of 1 to obtain a second temporary feature set.

S34, performing multi-resolution heat map aggregation in the second post-network to perform channel merging on the four-way feature map in the second temporary feature set through a convolution layer containing 20 1 × 1 convolutions and predict the probability of each element on the four-way feature map being near each key point to obtain key point heat maps output by different branches, then performing bilinear interpolation to sample the key point heat maps output by different branches to the resolution of the input image, and averaging to obtain a key point average heat map;

s35, rectifying the average characteristic diagram of each key point according to the value by the key area selection module in the second path of post-network to reduce the influence of the misrecognition point on the subsequent characteristic weighting extraction, specifically:

i) with 1/20 as the reference baseline, which is the maximum value in the heat map, the value in the heat map lower than the reference baseline is set to 0, and the value at or above the reference baseline is unchanged,

ii) selecting a connected heat block in each heat map by using a square frame, wherein the side length of the square frame is a larger value of the transverse and longitudinal distances corresponding to the connected heat blocks, solving the sum of all numerical values in each frame in each heat map, selecting the maximum value as a final reserved frame of the heat map, namely a key point area, and taking the center point of the frame as the key point coordinate, so that 20 key points and 20 key area coordinates are obtained in total.

And S36, multiplying the heat value in the normalized key point area by the corresponding area in the original image normalized by the background pixel by pixel to form 20 key point local features, and forming a first feature set together with the global features.

In the local manual labeling, in addition to labeling the key point position, a range corresponding to a key area is labeled by a rectangular frame, a label adopted by model training is a probability distribution heat map near the key point, a conventional key point probability distribution heat map is a two-dimensional Gaussian distribution map with a key point coordinate as a center and a standard deviation of 1, the range of the key area is used as a limit, elements exceeding the range of the key area in the two-dimensional Gaussian distribution map near the key point are set to be 0, namely only the heat map in the rectangular frame is reserved, and the purpose of doing so is to reduce the problem that the Gaussian distribution map exceeds the key area and reduce the interference of background factors on key point feature extraction. The loss function of the model training is the mean square error between the network predicted heat map and the artificially labeled heat map.

S4, fusing local features and global features of key points in the first feature set, embedding coordinate information and category information, inputting the coordinate information and the category information, and obtaining feature codes for calculating pedestrian similarity by using a transform coding model trained by a contrast learning strategy to form a second feature set;

as shown in fig. 5, the Transformer coding model includes: the model can re-model incidence relations of different key points according to the embedded position and category information, so that the interference of posture dissimilarity on pedestrian re-identification can be relieved to a certain extent. The invention provides a training strategy based on a mask, which can simulate the situation that a pedestrian target is partially shielded in a real scene, so that the trained model can relieve the interference of shielding on pedestrian re-identification to a certain extent.

S41, performing bilinear interpolation on the local features of the 20 square key points in the first feature set, converting the local features into a key point local feature map with the dimension of 28 multiplied by 3, and then splicing each row and each column of the local feature map in an end-to-end connection mode to obtain a vector with the dimension of 20 multiplied by 2352 as a local feature embedding matrix; then, the rows and columns of the 28 × 28 × 3 global features in the first feature set are spliced in an end-to-end mode to obtain a global feature embedding vector with the length of 2352, and the global feature embedding vector is added into the local feature embedding matrix to form a feature embedding matrix with the dimension of 21 × 2352.

S42, respectively constructing a category embedding matrix and a coordinate embedding matrix, adding the feature embedding matrix and the category embedding matrix element by element to obtain a feature matrix which contains category information and has the dimension of 21 x 2352, and splicing the feature matrix and the coordinate embedding matrix to obtain a final embedding matrix with the dimension of 21 x 2354; inputting the input into a Transformer coding model which is designed by the invention and trained by adopting a contrast learning strategy to obtain an output characteristic vector as a second characteristic set;

the row number of the category embedding matrix is a global feature category and 20 key point categories, the column number is the length of each feature embedding vector, the dimensionality is 21 multiplied by 2352, and the element numerical value of the ith row and the jth column in the matrix

The dimensionality of the coordinate embedding matrix is 21 multiplied by 2, and the dimensionality is 20 relative coordinates (x) of key points and global features on the background normalization picture respectively_r，y_r) The relative coordinates of the 20 key points are obtained by comparing the absolute coordinates (x, y) of the key points in the picture with the overall size (w, h) of the picture, i.e. the relative coordinates of the key points are determined by comparing the absolute coordinates (x, y) of the key points in the picture with the overall size (w, h) of the picture

The relative coordinates of the global feature are set to (0, 0).

Furthermore, besides the local feature and global feature embedding layer and the coordinate and category information embedding layer, the transform coding model trained by the comparison learning strategy also comprises a data random mask module and a multi-head self-attention encoder, wherein the data random mask module is only used for randomly masking data in the model training stage, and the data coding module is used for converting the final embedding matrix into a position feature vector;

the data random mask module is set as white Gaussian noise for each row feature in the final embedded matrix with the probability of 15%, equivalently, the key point feature is shielded, the module is activated only in the model training stage, and the final embedded matrix in the inference stage directly enters the multi-head self-attention encoder without passing through the module.

The multi-head self-attention encoder comprises a plurality of multi-head attention layers and linear layers, the number and the number of the multi-head attention layers and the number of the linear layers are not limited, wherein the output of the last layer of the encoder is the final characteristic used for calculating different similarities, and the dimensionality of the encoder is not limited by the multi-head self-attention encoder;

the comparative learning strategy training is as follows: inputting the training pictures into the model in batches to obtain the final characteristic representation of each picture, then calculating the dot product of the final characteristics of each picture, calculating a loss function L by the dot product, and then updating the model parameters by back propagation, wherein the loss function

Wherein: i is the ith training sample, j is the training sample belonging to the same pedestrian target as i, K is the set of the serial numbers of all samples in the batch, and tau is a hyperparameter. The batch size is related to the training effect of the model, the larger the batch is, the better the training effect is,

in order to overcome the limitation of the running space of the computing equipment on the batch size, in the training process, after the final features of the same batch are obtained, the loss of the final features is not directly calculated, or the features are temporarily stored, the final features are calculated after a plurality of subsequent batches of samples, then the final features of the batches are aggregated together, the loss of the batches is calculated, and then gradient updating is carried out. Based on the strategy, the invention can realize the training effect same as that of a larger batch without changing hardware equipment.

S5, calculating the overall similarity of the search target and each target in the object library to be searched to obtain a re-identification result, specifically: acquiring the final feature codes of all pedestrian targets appearing in different monitoring pictures by using the steps to serve as a feature library of an object to be searched; acquiring the final feature code of the searched target by adopting the steps; and respectively solving cosine distances between the final feature codes of the search target and the final feature codes of different pedestrians in the feature library of the object to be searched, taking the cosine distances as the similarity between the search target and each object to be searched, sequencing according to the similarity, and giving re-identification results from large to small.

As shown in table 1, which is the average accuracy of the method in the manually labeled real scene data set, fig. 6 is a sample of the test result of the method in this data set.

Table 1: real scene data set test result statistics

	Method for producing a composite material	HOReid	TransReid
				Map	35.4	26.5	31.2
Rank-1	42.7	30.3	36.3

Compared with the prior art, the method has a good recognition effect on low-quality samples in a real monitoring picture, and particularly can remarkably improve the interference problem caused by the common interference factors of large difference of illumination conditions, large difference of background environments, inconsistent postures and local shielding to the re-recognition of pedestrians.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A pedestrian cross-mirror re-recognition method based on background and posture normalization is characterized in that local features and global features of key points are extracted from a background normalized image of an original image through a two-way human posture estimation network, after coordinates and category information of the key points are fused into the features of the key points, the key points are input into a transform coding model trained by adopting a contrast learning strategy, and the similarity of features output by different targets is obtained and used as the basis of pedestrian cross-border re-recognition;

the two-way human body posture estimation network comprises: the system comprises a main network, a global feature aggregation module, a feature super-resolution module and a multi-resolution heat map aggregation module, wherein the main network is used for extracting four paths of feature maps with different resolutions, the global feature aggregation module is used for aggregating the four paths of feature maps with different resolutions into a global feature, the feature super-resolution module is used for amplifying the resolutions of the four paths of feature maps, and the multi-resolution heat map aggregation module is used for heat map prediction and heat map fusion;

2. The method for recognizing the pedestrian across the mirrors based on the background and attitude normalization of claim 1, wherein the background normalized image of the original image is obtained by the following method: decoding an original image, and preprocessing a pedestrian picture by adopting a homomorphic filtering algorithm to obtain a uniform illumination picture set; and (3) carrying out example segmentation on the picture with uniform illumination, extracting the pedestrian image, and embedding the pedestrian image into a rectangular frame with RGB (255,255 and 255) filling colors to obtain a background normalized picture set.

3. The method for recognizing the pedestrian across mirrors based on the background and attitude normalization according to claim 1, wherein the uniform illumination image set is obtained by: decoding the original image, and recording the decoded image as F_in(ii) a Taking logarithm of the image pixel matrix, and recording as Z_in，Z_in＝ln(F_in+ 1); image transformation into the frequency domain, denoted X, using a fast fourier transform_ω，X_ω＝FFT(Z_in) (ii) a Filtering the frequency domain image by using a Gaussian high-pass filter H, wherein the filtering result is recorded as S, and S is H (X)_ω) (ii) a Restoring the frequency domain image to the airspace by using inverse fast Fourier transform to obtain Z_out，Z_outIfft(s); indexing the recovered airspace data to obtain an image with balanced illumination, and recording the image as F_out，F_out＝exp(Z_out)。

4. The method for recognizing the pedestrian across the mirrors based on the background and attitude normalization of claim 1, wherein the background normalization picture set is obtained by the following method: extracting the characteristics of the whole picture by using a pre-trained target detection model, and predicting the coordinates of the candidate frame of the potential target area; cutting out a characteristic sub-image of the potential target area in the whole image characteristic according to the candidate frame coordinate; each element in the characteristic subgraph of the potential target area is taken as a node in the graph, the characteristic subgraph is input into a preposed graph convolution layer, the incidence relation between different elements is extracted, the mask information of a shelter in the potential target frame is obtained, and the boundary of the shelter and the sheltered target is decoupled; adding the potential target area feature subgraph and the feature processed by the preposed graph neural network, inputting a postposition graph convolution network, and outputting the segmentation results of the target area shielding object and the shielded target; and (4) extracting the segmented pedestrian target from the original picture, and embedding the segmented pedestrian target into a rectangular frame with the filling color of RGB (255,255 and 255) to obtain a background normalized picture set.

5. The method for pedestrian cross-mirror recognition based on background and posture normalization of claim 1, wherein a backbone network in the two-way human body posture estimation network is simplified by a final feature aggregation module compared with an HRNet;

the global feature aggregation module comprises: 1 × 1 convolutional layer and 3 × 3 convolutional layer;

the feature super-resolution module comprises: a bilinear interpolation layer and a convolution layer with a step length of 1;

6. The method for recognizing the pedestrian crossing the mirror and being based on the background and attitude normalization as claimed in any one of claims 1 to 5, wherein the method specifically comprises the following steps:

7. The method for pedestrian cross-mirror recognition based on background and attitude normalization of claim 6, wherein the step S2 specifically comprises:

s21, selecting a rectangular area containing pedestrians from the illumination balance picture frame by using the trained target detection model;

s22, according to the information of the rectangular boundary frame of the potential target, cutting out a characteristic subgraph of the potential target area in the overall characteristics of the picture by adopting a trained example segmentation model containing two graph convolution layers, wherein a loss function used in the training of the example segmentation model is the sum of classification loss, regression loss and center distance loss output by the example segmentation models of the two graph convolution layers;

s23, extracting the obstruction mask through a preposed graph neural network; inputting a characteristic subgraph of a potential target area into a conventional convolutional layer, taking each element in an output characteristic graph as a node in the graph, inputting the node into one graph convolutional layer, and extracting an association relation between different elements, wherein the association relation specifically comprises the following steps: multiplying the input features by a weight matrix which is an adjacent matrix of the graph, performing 1 multiplied by 1 convolution layer dimension change, performing nonlinear transformation by adopting a Softmax operator, and finally activating by a Relu operator to obtain the incidence relation among different region elements; after the incidence relation between different elements is input into a full connection layer, mask information of the shielding object in the potential target frame is obtained;

s24, extracting the mask of the occluded object through a post graph neural network; the input of the step is the result of adding the potential target area characteristic subgraph and the characteristic processed by the preceding graph neural network, and the output is the mask information of the occluded target;

8. The method for pedestrian cross-mirror recognition based on background and attitude normalization of claim 6, wherein the 20 key points comprise: human top, eye, oral area, nape, left shoulder, right shoulder, left elbow, right elbow, left hand, right hand, left hand portable thing, right hand portable thing, go up trunk center, crotch, left knee, right knee, left ankle, right ankle, left foot, right foot, wherein: the top, the neck, the portable object and the upper trunk center part are used as prediction key points, so that the information of a hairstyle, a handbag and upper body clothes can be captured, and the pedestrian heavy identification scene is more suitable.

9. The method for pedestrian cross-mirror recognition based on background and attitude normalization of claim 6, wherein the step S3 specifically comprises:

s31, constructing an HRNet not containing a terminal aggregation layer as a backbone network, and inputting a background normalization picture set to obtain four paths of feature maps with different scales as a temporary feature set;

s32, the global feature aggregation module serving as the first post-network converts 4 feature maps in the temporary feature set into feature maps with dimensions of 28 × 28 through convolution operation, adjusts the number of channels of each feature map to 1 by using a 1 × 1 convolutional layer, and inputs the 4 28 × 28 single-channel feature maps into a convolutional layer containing 3 × 3 convolutional kernels to obtain a feature map with dimensions of 28 × 28 × 3 as the global features of the background normalized picture;

s33, the feature super-resolution module in the second path post-network performs bilinear interpolation operation on the feature map output by each branch in the temporary feature set, the resolution of the output feature map is expanded to be 2 times of the output of each branch in the temporary feature set in the width and height directions, namely the element number of the feature map is expanded to be 4 times, and then feature re-fusion is performed through a convolution layer with the step length of 1 to obtain a second temporary feature set;

ii) selecting a connected heat block in each heat map by using a square frame, wherein the side length of the square frame is a larger value of the transverse and longitudinal distances corresponding to the connected heat blocks, solving the sum of all numerical values in each frame in each heat map, selecting the maximum value as a final reserved frame of the heat map, namely a key point area, and taking the center point of the frame as the key point coordinate, so that 20 key points and 20 key area coordinates are obtained in total;

10. A system for realizing the pedestrian cross-mirror re-recognition method based on the background and attitude normalization in any one of claims 1 to 9 is characterized by comprising the following steps: the video preprocessing module, the background normalization module, the key point feature extraction module, the feature coding module and the similarity calculation module, wherein: the video preprocessing module decodes the monitoring video, acquires an identified monitoring picture to be processed and performs homomorphic filtering to obtain a uniform illumination picture set; the background normalization module processes background information of the pedestrian target in the monitoring picture, separates the pedestrian target from the uniform illumination picture, and embeds the pedestrian target into the rectangular frame to obtain a background normalization picture set; the key point feature extraction module processes the background normalized picture set, and obtains key point features and global features of the pedestrian target by adopting a double-path attitude estimation network to synthesize different scales to obtain a first feature set; the feature coding module further processes the first feature set, embeds key point positions and category information, models incidence relations of different key points, inhibits attitude dissimilarity and shields interference factors, and obtains a second feature set; and the similarity calculation module calculates the similarity of the second feature set of the recognition target and the object to be searched to obtain a pedestrian re-recognition result.