CN114973305A

CN114973305A - Accurate human body analysis method for crowded people

Info

Publication number: CN114973305A
Application number: CN202111645897.XA
Authority: CN
Inventors: 刘骊; 韦勇; 付晓东; 黄青松; 刘利军
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-08-30
Anticipated expiration: 2041-12-30
Also published as: CN114973305B

Abstract

The invention relates to an accurate human body analysis method for crowds, and belongs to the field of computer vision and image application. Firstly, inputting a crowded crowd image set, extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame; secondly, the foreground accurate semantic feature maps are sampled to the same size and fused together to generate high-resolution features, and a human body coarse analysis result is obtained through preliminary analysis; then, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result; and finally, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result. The invention can effectively analyze the human body images in crowds.

Description

Accurate human body analysis method for crowded people

Technical Field

The invention relates to an accurate human body analysis method for crowds, and belongs to the field of computer vision and image application.

Background

Human body analysis is a fine-grained semantic segmentation task, aims to identify components of pixel-level human body images, such as body parts and clothes, is a basic task in multimedia and computer vision, and has good potential for problems in various visual scenes, such as behavior analysis, video image understanding, intelligent security and the like. The known method considers the existence of semantic features of different sizes, and uses, for example, an FCN structure, a DeeplabV1 structure, a DeeplabV3 structure, a SegNet structure and an ASPP structure, and aims to improve human body analysis by extracting multi-scale semantic features. However, only considering multi-scale information does not well consider deep relationships between pixels, and complex interaction relationships between human instances in crowded scenes cannot be well modeled. In terms of technology, some key problems of human body analysis of crowds are not solved well, and the key problems are mainly reflected in three aspects: 1) the background is complex, and the background color is too similar to the clothes of a person; 2) the number of human examples is large in change, the motion postures of the human examples are various, and people in a complex motion environment have strong interaction and are difficult to determine the characteristic attribution problem; 3) there are complex occlusions in crowded environments, including self-occlusion of people, occlusion of people and things, and mutual occlusion between human instances. These occlusions have a large influence on the accuracy of human body analysis. The three aspects are the key to solve the human body analysis in crowds urgently.

Known human body analysis methods are mainly based on feature enhancement, multi-task methods and the like. For example, Zhang X (< neuro rendering >402,2020,375-383) proposes a Semantic Spatial Fusion Network (SSFNET) for human body parsing to narrow the semantic gap and to give accurate high resolution prediction by aggregating multi-resolution features. Zhang Z (< IEEE/CVF Conference on Computer Vision and Pattern Recognition >,2020, 8897-. However, although these known methods supplement human body analysis by using multi-scale semantics and other tasks, they achieve a good effect on single-person analysis, and can also be extended to multi-person situations by combining with a target detection algorithm. But it depends too much on the accuracy of the target detection method, does not consider the relationship between different human instances, and is difficult to produce good effect in case of crowd. The patent CN113111848A adds a hole convolution in each layer of features between the encoder and the decoder to perform multi-scale feature fusion so as to enhance the feature extraction capability of the model, and solves the problem that the traditional human body analysis method is insufficient in human body edge detection pixel precision. The method is only a simple stacked cavity convolution structure, has a large amount of redundant calculation, is not well suitable for a human body analysis task only by solving the problem of edge precision, only adds a cavity convolution layer in the last layer of characteristics, and adds a super-pixel characteristic characterization image internal structure in the middle of a codec, so that a preliminary human body structure can be obtained while obtaining a precise edge. The patent CN113537072A performs posture estimation and human body analysis tasks respectively by sharing multi-scale features extracted from the backbone and adopting a joint learning mode after non-localization processing. Although it takes into account the same points of the pose and the resolution task, the disparity between the two tasks is ignored and the method is only applicable to the single person resolution case.

Disclosure of Invention

The invention provides an accurate human body analysis method for crowded people, which is used for effectively analyzing images of the crowded people to obtain an accurate human body analysis result, thereby meeting the current accuracy requirement on the crowded people analysis.

The technical scheme of the invention is as follows: a method for accurately analyzing human bodies of crowds comprises the steps of firstly inputting crowds image sets, extracting coarse image layering characteristics and super-pixel characteristics through a depth residual error network, performing characteristic representation on human body images to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame; secondly, the foreground accurate semantic feature maps are sampled to the same size and fused together to generate high-resolution features, and a human body coarse analysis result is obtained through preliminary analysis; then, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result; and finally, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result.

The method comprises the following specific steps:

step1, input crowded group image set G ═ G ₁ ,G ₂ ,...G _n Extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame;

step2, sampling foreground accurate semantic feature graphs of different scales to the same size through a bilinear interpolation method, fusing the foreground accurate semantic feature graphs to generate high-resolution features, and obtaining a human body coarse analysis result through preliminary analysis.

Step3, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame, defining a joint loss function to inhibit interference joints, generating human body joint points, defining a human body posture association rule, carrying out association connection on all the generated joint points, and refining to obtain a multi-person accurate posture estimation result;

and Step4, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result.

The Step1 is concretely as follows:

first, a hierarchical feature P ═ P { P } is extracted from an input crowd image set G using the ResNet101 ₁ ,P ₂ ,P ₃ ,P ₄ ,P ₅ And generating a series of superpixel partitions S ═ S using COB convolution-oriented boundaries ⁰ ,S ¹ ,...,S ^N In which S is ^N Is a super pixel, S, representing the entire image ^N Is S ^N-1 Are combined. Matching P ₂ ,P ₄ ,P ₅ Size of (d) is selected as a subset N ═ N in S ₂ ,N ₄ ,N ₅ And h, wherein the number of nodes between adjacent levels is 1/4 times.

Then, for P ₂ ,P ₄ ,P ₅ Performing feature mapping

To map to a graph matrix, where W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ _min (P _l ^e ) And Δ _max (P _l ^e ) Respectively representing minimum pooling and maximum pooling, P _l ^e Representing grid cells that jointly correspond to hierarchical superpixel blocks.

And then extracting context and hierarchical information of the mapping characteristics through the graph neural network, combining the characteristic pyramid decoding characteristics for fusion, adding space and channel attention in the graph neural network for reducing redundant calculation, and obtaining a final characteristic representation result. Given mapping node i and a set of neighboring nodes C _i The spatial attention of node i is expressed as:

wherein, M is a self-attention head,

is the sum of the feature vector sets collected by the neighbor nodes of the node i. The channel attention is shown as

Mean of the eigenvectors representing node i and its neighbors, σ represents the fully connected layer of Sigmod activation, and

element multiplication is performed. The attention can be finally paidIs shown as

Where β is the scale weight initialized to 0.

Finally, obtaining the foreground accurate semantic feature F based on the feature representation result _f ＝{P' _u U is more than or equal to 1 and less than or equal to 5 }. Inputting the feature representation result into the hierarchical cascade RPN to obtain a candidate region, and generating a human body candidate multi-region detection frame D ═ D through classification and regression prediction _v V represents the number of people in the graph.

The Step3 is concretely as follows:

first, affine transformation is performed on all the detected human body candidate multi-region detection frames in Step 1.

Then, each transformed human body is respectively input into a single posture estimation module to generate a joint heat map

And defining two joint types, which are respectively target joints

And interference joint

Wherein

Definition loss

To inhibit interfering joints, where RSME is the root mean square error function, K denotes the kth joint of the v-th individual, and K is the total number of joints.

Then, defining a human body posture association rule, performing association connection on all generated joint points, and generating a skeleton labeling graph A ═ { A ═ A } _L L denotes the number of generated connections. And defines a pose similarity function f (A) _x ,A _y ,η)＝1[d(A _x ,A _y |Λ,λ)≤η]To eliminate similar gestures, wherein d (-) is the defined gestureAnd the state distance function is lambda, lambda is a parameter set, and eta is a distance threshold. If d ≦ η, determine A _x For redundancy, elimination is performed. d (-) is in particular d (A) _x ,A _y |Λ)＝Q(A _x ,A _y |η ₁ )+μH(A _x ,A _y |η ₂ ) And calculating the confidence coefficient of the matched joint points between the posture skeletons by the function Q to obtain the number of the matched joints between the postures. The function H calculates the spatial distance, η, between the joints ₁ ,η ₂ Is a tunable parameter in the function, μ is a weight that balances two distances, Λ ═ η ₁ ,η ₂ ,μ}。

Finally, restoring the coordinates of the original image through inverse affine transformation, and refining to generate an accurate attitude estimation result F _pose 。

The specific process of Step4 is as follows:

firstly, an accurate attitude estimation result F is obtained _pose And the coarse analysis result F obtained in Step2 _parsing And (4) cascading, and segmenting labels of different human body examples through human body multi-region candidate frames and posture constraints.

Then by defining a semantic spatial distance loss, L _{f_parsing} ＝λ ₁ L _parsing +L _parsing L _pose +λ ₂ L _pose To reduce the gap between different semantics. Wherein λ ₁ Is to resolve the loss weight, λ ₂ It is the weight loss for the pose that,

the loss of the coarse resolution is shown,

representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is _i The category of the ith pixel point is represented; ln (f) _m ) Representing the probability of prediction as an m-th class of semantics; n is the total number of joint points;

respectively represent the nth joint atPixel coordinates and true coordinates in the image.

And finally, mapping the cascaded result into NxC size through convolution, mapping the analysis result into CxN size, fusing, inputting into 3 convolution layers with 7 x 7 dimensionalities of 128 for thinning, and obtaining the final accurate analysis result.

The invention has the beneficial effects that:

1. the known method only adopts a neural network with fixed topology to carry out feature interaction in space and scale in multi-scale feature learning, ignores the internal structure of the image, and leads to the loss or weakening of basic feature information in the process of propagation and interaction. According to the method, the super-pixels are mapped to the graph nodes to inherit the inherent hierarchical structure of the image, and different levels and feature extraction middle-layer subsets are fused, so that not only can the enhanced multi-scale features be obtained, but also the inherent structure of the image can be better fused, finer-grained feature representation is provided for subsequent target detection and segmentation tasks, and the accuracy of subsequent human body analysis is improved.

2. Due to the fact that the image in a crowded scene has the problems of complex background and posture, large change of the number of human instances, mutual occlusion and the like. Most of the known human body analysis methods only consider the pixel precision in multiple scales to improve the human body analysis precision, so that the model is seriously degraded in a crowded scene, and the complicated posture and the shielding environment are not considered, so that higher precision is difficult to obtain. The invention provides the top-down multi-person posture estimation, different human bodies are segmented through accurate target detection, in addition, the semantic space distance loss is designed through modeling shielding and complex postures and combining human body semantic information obtained through human body analysis from bottom to top, the crowded crowd image is accurately analyzed, and the accuracy is higher.

3. The invention extracts common characteristics through the improved Res network, respectively carries out human body analysis and posture estimation, maps the posture characteristics into the human body analysis, supplements the human body analysis, and can effectively solve the problems existing in crowded human body analysis. According to the method, the super-pixel characteristics are added in the characteristic representation, the relationship between pixels is established by using the graph convolution, accurate foreground characteristics are obtained, and the crowd attitude estimation and the rough human body analysis are respectively carried out through the same foreground characteristics. The crowded posture accurate joint point characteristics and the human body part characteristics in the human body analysis are used for complementation, and a joint optimization module is designed to obtain an accurate human body analysis result, so that the precision is high.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a detailed flow chart of the features of the present invention.

FIG. 3 is a flowchart illustrating a rough analysis method according to the present invention.

FIG. 4 is a diagram of an example of an accurate pose estimation according to the present invention.

FIG. 5 is a diagram of an example of an accurate analysis result according to the present invention.

FIG. 6 is a flowchart illustrating an embodiment of the present invention.

Detailed Description

Example 1: as shown in fig. 1 to 6, a method for accurate human body analysis for crowded people includes the following steps:

step1, inputting a crowded crowd image set, extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame;

step2, sampling the foreground accurate semantic feature map to the same size, fusing the foreground accurate semantic feature map and the foreground accurate semantic feature map to generate high-resolution features, and obtaining a human body coarse analysis result through preliminary analysis;

step3, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result;

Further, Step1 may be specifically set as follows:

first, a hierarchical feature P ═ P { P } is extracted from an input crowded population image set G by using ResNet-101 ₁ ,P ₂ ,P ₃ ,P ₄ ,P ₅ And generating a superpixel partition series S ═ S using COB convolution-oriented boundaries ⁰ ,S ¹ ,...,S ^N In which S is ^N Is a super pixel, S, representing the entire image ^N Is S ^N-1 Are combined. Matching P ₂ ,P ₄ ,P ₅ Size of (d) is selected as a subset N ═ N in S ₂ ,N ₄ ,N ₅ And h, wherein the number of nodes between adjacent levels is 1/4 times.

Then, for P ₂ ,P ₄ ,P ₅ Performing feature mapping

And then extracting context and hierarchical information of the mapping characteristics through the graph neural network, combining the characteristic pyramid decoding characteristics for fusion, adding space and channel attention in the graph neural network for reducing redundant calculation, and obtaining a final characteristic representation result. Given a mapping node i and a set of neighboring nodes C _i The spatial attention of node i is expressed as:

wherein, M is the self-attention head,

Representing feature vectors of node i and its neighborsOn average, σ denotes the fully connected layer of Sigmod activation, and

element multiplication is performed. Attention may ultimately be expressed as

Where β is the scale weight initialized to 0.

Finally, obtaining the foreground accurate semantic feature F based on the feature representation result _f ＝{P' _u U is more than or equal to 1 and less than or equal to 5. Inputting the feature expression result into a hierarchical cascade RPN to obtain a candidate region, and generating a human body candidate multi-region detection frame D (D) through classification and regression prediction _v V represents the number of people in the graph.

Further, Step3 may be specifically set as follows:

firstly, affine transformation is carried out on all the detected human body candidate multi-region detection frames D in Step 1.

And defining two joint types, namely a target joint

And interference joint

Wherein

Definition loss

Then defining human body posture association rule, and closing all the generated joint pointsConnecting to generate skeleton labeled graph A ═ { A ═ A _L L denotes the number of generated connections. And defining a pose similarity function f (A) _x ,A _y ,η)＝1[d(A _x ,A _y |Λ,λ)≤η]To eliminate similar poses, where d (-) is a defined pose distance function, Λ, λ is a parameter set, and η is a distance threshold. If d ≦ η, determine A _x For redundancy, elimination is performed. d (-) is in particular d (A) _x ,A _y |Λ)＝Q(A _x ,A _y |η ₁ )+μH(A _x ,A _y |η ₂ ) And calculating the confidence coefficient of the matched joint points between the posture skeletons by the function Q to obtain the number of the matched joints between the postures. The function H calculates the spatial distance, η, between the joints ₁ ,η ₂ Is an adjustable parameter in the function, mu is a weight for balancing two distances, Λ ═ η ₁ ,η ₂ ,μ}。

Finally, original image coordinates are restored through inverse affine transformation, and an accurate attitude estimation result F is generated through refinement _pose 。

Further, Step4 may be specifically as follows:

Then defining a semantic space distance loss L _{f_parsing} ＝λ ₁ L _parsing +L _parsing L _pose +λ ₂ L _pose To reduce the gap between different semantics. Wherein λ ₁ Is to resolve the loss weight, λ ₂ It is the weight loss for the pose that,

the loss of the coarse resolution is shown,

representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is _i It represents the ith pixel pointThe category of (d); ln (f) _m ) Representing the probability of prediction as an m-th class of semantics; n is the total number of joints;

respectively representing the pixel coordinates and the real coordinates of the nth joint in the image.

Example 2: a precise human body analysis method aiming at crowds comprises the following specific steps:

step1, as shown in fig. 2, image (a) is an input original image; extracting hierarchical features from the input crowded crowd image (a) by using Resnet-101, see FIG. b, and generating a superpixel partition series (e) S ═ S using COB convolution-oriented boundaries ⁰ ,S ¹ ,...,S ^N In which S is ^N Is a super pixel, S, representing the entire image ^N Is S ^N-1 Are combined. Matching P ₂ ,P ₄ ,P ₅ Size of (d) is selected as a subset N ═ N in S ₂ ,N ₄ ,N ₅ And (5) setting the number of nodes between adjacent layers to be 1/4 times.

Then to P ₂ ,P ₄ ,P ₅ Performing feature mapping, specifically, firstly, P ₂ ,P ₄ ,P ₅ Feature mapping to/ ^th On the rectangular grid, and then assigning grid cells to the super-pixel grid cell set P _l ^e Wherein each grid corresponds to a small rectangular area of the input image (a). The concrete formula is

Where W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ _min (P _l ^e ) And Δ _max (P _l ^e ) Representing minimum pooling and maximum pooling, respectively.

Then extracting through a graph neural networkThe context and hierarchy information of the mapped features are taken, see fig (d). The three-layer graph neural network is respectively used for extracting context information, layering information and context information, each layer has own learning parameter and is not shared with other layers; in order to reduce the redundant calculation, space and channel attention is added in the graph neural network, and the combined feature pyramid decoding features are fused to obtain a final feature representation result, which is shown in a graph (c). Given mapping node i and a set of neighboring nodes C _i The spatial attention of node i is expressed as:

wherein, M is the self-attention head,

is the sum of the feature vector sets collected by the neighbor nodes of node i. The channel attention is shown as

Mean of the eigenvectors representing the node and its neighbors, σ represents the fully connected layer activated by Sigmod, and

element multiplication is performed. Attention may ultimately be expressed as

Where β is the scale weight initialized to 0.

Finally, based on the feature representation result graph (c), obtaining the foreground accurate semantic feature F _f ＝{P' _u L 1 is less than or equal to u is less than or equal to 5}, the marked example graph is shown as (g), the characteristic representation result is input into the hierarchical cascade RPN to obtain a candidate region, and a human body candidate multi-region detection frame D is generated through classification and regression prediction _v And v represents the number of people in the graph, see fig. (f).

The specific flowchart of Step1 is shown in fig. 2. After Step1, the foreground can be obtainedAccurate semantic features (g) and human body candidate multi-region detection frames (f). The data set is selected from a Human body analysis general data set, such as CIHP, Muti-Human matching V2.0 and the like, and is mainly based on multi-person images, wherein each image contains more than 3 persons on average, and the total number of the images is about 6 ten thousand: the method is divided into 43683 training sets, 10000 testing sets and 10000 verification sets. In this example, a pytorech is used to perform an experiment using a crowded population image as an input. The model is trained in the first stage, and training parameters are continuously adjusted, so that the model obtains better foreground semantic features and detection frame precision. The first stage of quantitative comparison is shown in Table 1, and this example is compared with other typical deep learning models for target detection in the known methods, such as fast RCNN, fast RCNN + FPN and YOLOV3, where Params is the number of learnable parameters of the model, GFLOPS is the number of floating point operations, Test Speed is the detection Speed, AP is the number of floating point operations, and ^bbox @0.5IOU is the detection frame average accuracy, and it is seen from the results that although the calculated amount is comparatively large, high accuracy is obtained.

TABLE 1

Method	Params	GFLOPS	Test Speed/ms	AP ^bbox @0.5IOU
					fast RCNN	34.6M	172.3	13.9	65.6％
fast RCNN+FPN	64.1M	240.6	5.1	68.3％
					YOLOV3	239M	-	8.12	71.7％
The invention	113M	387.6	25.5	73.1％

Step2, as shown in FIG. 3, using Step1 foreground precise semantic feature (a) as input, and generating multi-scale feature P 'by bilinear interpolation' ₂ ,P' ₄ Up sampling to P' ₅ Scale of grade, and respectively using 1 × 1 convolution to make characteristic arrangement to obtain P ₂ ″,P ₄ "for aligning the same semantic space, P is extracted using the ASPP structure shown in (c) ₅ ' Multi-Scale feature P thereof ₅ ". Association of P ₂ ″,P ₄ ″,P ₅ "pass through 1 × 1 convolution layer to predict all human semantic regions, and divide to obtain coarse analysis result

θ represents a 1 × 1 convolution, see fig. 3 (b).

Step3, affine transformation is firstly performed on all the detected human body candidate multi-region detection frames obtained in Step1 in D.

Wherein beta is ₁ ,β ₂ ,β ₃ Is a parameter vector.

And

the coordinates before and after transformation, respectively.

Then, as shown in table 2, 18 joint points are defined for each body part of the human body, and each of the transformed human bodies is input to the single posture estimation module. Generating joint heat maps

As shown in fig. 4 (b). Then two joint types are defined, which are respectively target joints

And interference joint

Wherein

By defining a loss function

To inhibit interfering joints, where RSME is the root mean square error function, K denotes the kth joint of the v-th individual, and K is the total number of joints. During training, joints which do not belong to the human body example in each detection box are restrained by using the defined loss function, and misconnection is reduced.

TABLE 2

Body part	Joint point
		Head part	Joint of head and neck, left eye and right eye
Upper limb	The left and right elbow joints, the left and right wrist joints
		Lower limbs	Left and right ankle joint, left and right knee joint
Trunk	Left and right shoulder joints, left and right hip joints, pelvic joints, spinal joints

Then, defining a human body posture association rule, performing association connection on all generated joint points, and generating a skeleton labeling graph A ═ { A ═ A } _L L denotes the number of generated connections. And defining a pose similarity function to eliminate similar poses: f (A) _x ,A _y ,η)＝1[d(A _x ,A _y |Λ,λ)≤η]Where d (-) is a defined pose distance function, Λ, λ is a set of parameters, and η is a distance threshold. If d ≦ η, determine A _x For redundancy, elimination is performed. d (-) is in particular d (A) _x ,A _y |Λ)＝Q(A _x ,A _y |η ₁ )+μH(A _x ,A _y |η ₂ ) And calculating the confidence coefficient of the matched joint points between the posture skeletons by the function Q to obtain the number of the matched joints between the postures. The function H calculates the spatial distance, η, between the joints ₁ ,η ₂ Is a tunable parameter in the function, μ is a weight that balances two distances, Λ ═ η ₁ ,η ₂ μ }. The specific formula is as follows:

wherein

To represent

The center of the detection frame is detected,

indicates the position of the nth joint,

representing the confidence score for the location of the nth joint.

Finally, original image coordinates are restored through inverse affine transformation, and an accurate attitude estimation result F is generated through refinement _pose As shown in fig. 4 (c).

As shown in tables 1 and 3 and fig. 4, after Step3, an accurate attitude estimation result F can be obtained _pose And taking the human body candidate multi-region detection box generated by Step1 as an input, performing the second stage of training, and keeping the optimal training parameters of the two stages so as to be convenient for fine tuning during the last stage of training. The invention can obtain a detection result with higher precision, which is shown in table 1. Meanwhile, based on the detection frame, a human joint heat map is obtained, as shown in fig. 4 (b).

For multi-person pose estimation, the invention has higher precision, and table 3 shows that the example is compared with other models typically used in the known method, such as MASK RCNN, RMPE, HR-Net, and openpos, wherein ap (average precision) is the average accuracy rate used for calculating the precision percentage of the test set; OKS (object keypoint similarity) is the similarity of key points, and the similarity is calculated by adding the Euclidean distance of a scale, and is mainly used in a multi-person posture estimation task. The specific formula is as follows:

where v denotes someone in the GT and v ^k A key point representing a person's presence,

representing the euclidean distance of the currently detected set of keypoints to keypoints of the same id in the GT,

the visibility of this key point is 1, i.e. the key point is unobstructed and marked

Indicating that the key point is occluded but marked, S _v Scale factor representing the human in GT, whose value is the square root of the area of the detection box, σ _v And d represents a normalization factor of the key point with v, and delta (x) represents 1 if x is true, and 0 otherwise. T is a manually set threshold, from table 3 we take 50,75, mean M and minimum L.

TABLE 3

Method	AP	AP ^oks＝50	AP ^oks＝75	AP ^M	AP ^L
						Mask RCNN	62.9	87.6	67.9	57.5	71.3
RMPE	72.1	88.8	79.1	68.1	77.9
						HR-Net	75.5	92.4	83.3	71.9	81.5
Openpose	61.8	84.9	67.5	57.1	68.2
						The invention	76.4	90.1	84.5	72.3	83.1

Step4, first, the accurate attitude estimation result F _pose FIG. 4(c) and the crude analysis result F obtained at Step2 _parsing (FIG. 5(b)) a cascade is performed, segmenting the labels of different human instances by human multi-region candidate boxes and pose constraints.

the loss of the coarse resolution is shown as,

representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is _i The category of the ith pixel point is represented; ln (f) _m ) Representing the probability of prediction as the mth type of semantics; n is the total number of joints;

And finally, mapping the cascaded result into NxC size through convolution, mapping the coarse analysis result into CxN size, fusing and inputting the fused result into 3 convolution layers with 7 x 7 dimensionalities of 128 for thinning, and obtaining the final accurate analysis result.

As shown in fig. 6, the final accurate analysis result (c) is obtained after Step 4. The space semantic distance loss function is introduced to carry out the training of the third stage through the rough analysis result (b) and the posture estimation result (a) obtained by the training of the first two stages, the comparison of the example and other typical example-level human body analysis models such as NAN, M-CE2P and RP-R-CNN in the known method under the same data set is shown in the table 4, and the comparison can be seenThe indexes of the invention are superior to other known methods. Wherein the mIOU calculates the ratio of intersection and union of the two sets of the real value and the predicted value, and the ratio can be transformed into the sum (union) of TP, FP and FN on the ratio of TP (intersection), and the formula is that mIOU is TP/(FP + FN + TP); PCP ₅₀ As a percentage of correct sites, if the distance between the predicted tag position and the true value is less than half of the total pixels (usually expressed as PCP @0.5), then it is considered to be correctly detected; AP (Access Point) ^P Is the average precision based on the part.

TABLE 4

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A precise human body analysis method for crowds is characterized in that: the method comprises the following steps:

2. The accurate human body analysis method for crowded people according to claim 1, wherein: the specific process of Step1 is as follows:

first, a hierarchical feature P ═ P { P } is extracted from an input crowded population image set G by using ResNet101 ₁ ,P ₂ ,P ₃ ,P ₄ ,P ₅ And generating a superpixel partition series S ═ S using COB convolution-oriented boundaries ⁰ ,S ¹ ,...,S ^N In which S is ^N Is a super pixel, S, representing the entire image ^N Is S ^N-1 In two superpixel combinations, match P ₂ ,P ₄ ,P ₅ Size of (d) is selected as a subset N ═ N in S ₂ ,N ₄ ,N ₅ -wherein the number of inter-adjacent-level nodes is 1/4 times;

then, for P ₂ ,P ₄ ,P ₅ Performing feature mapping

To map to a graph matrix, where W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ _min (P _l ^e ) And Δ _max (P _l ^e ) Respectively representing minimum pooling and maximum pooling, P _l ^e A grid cell representing a joint corresponding hierarchical superpixel partition;

then, extracting context and hierarchical information of mapping characteristics through a graph neural network, combining characteristic pyramid decoding characteristics for fusion, adding space and channel attention in the graph neural network for reducing redundant calculation, and obtaining a final characteristic representation result; given mapping node i and a set of neighboring nodes C _i The spatial attention of node i is expressed as:

wherein, M is a self-attention head,

is node i neighbor node collectionThe sum of the feature vector sets of (a); the channel attention is shown as

performing element multiplication, attention is finally expressed as

Where β is the scale weight initialized to 0;

finally, obtaining the foreground accurate semantic feature F based on the feature representation result _f ＝{P′ _u L 1 is more than or equal to u is less than or equal to 5, the feature representation result is input into the hierarchical cascade RPN to obtain a candidate region, and a human body candidate multi-region detection frame D is generated through classification and regression prediction _v V represents the number of people in the graph.

3. The accurate human body analysis method for crowded people according to claim 1, wherein: the specific process of Step3 is as follows:

firstly, carrying out affine transformation on all detected human body candidate multi-region detection frames D in Step 1;

And defining two joint types, which are respectively target joints

And interference joint

Wherein

Definition loss

To inhibit interfering joints, wherein RSME is a root mean square error function, K denotes the kth joint of the v-th individual, K is the total number of joints;

then, defining a human body posture association rule, performing association connection on all generated joint points, and generating a skeleton labeling graph A ═ { A ═ A } _L L denotes the number of generated connections and defines a pose similarity function f (A) _x ,A _y ,η)＝1[d(A _x ,A _y |Λ,λ)≤η]Eliminating similar gestures, wherein d (-) is a defined gesture distance function, Λ, λ is a parameter set, and η is a distance threshold; if d ≦ η, determine A _x For redundancy, elimination is performed; d (-) is in particular d (A) _x ,A _y |Λ)＝Q(A _x ,A _y |η ₁ )+μH(A _x ,A _y |η ₂ ) Wherein function Q calculates confidence of the matched joint points between the posture skeletons to obtain the number of the matched joints between the postures, and function H calculates the space distance, eta, between the joints ₁ ,η ₂ Is a tunable parameter in the function, μ is a weight that balances two distances, Λ ═ η ₁ ,η ₂ ,μ}；

4. The method of claim 1, wherein the method comprises: the specific process of Step4 is as follows:

firstly, an accurate attitude estimation result F is obtained _pose And the coarse analysis result F obtained in Step2 _parsing Cascading, namely segmenting different human body example labels through human body multi-region candidate frames and posture constraints;

then defining a semantic spaceDistance loss L _{f_parsing} ＝λ ₁ L _parsing +L _parsing L _pose +λ ₂ L _pose To reduce the gap between different semantics; wherein λ ₁ Is to resolve the loss weight, λ ₂ It is the weight loss for the pose that,

the loss of the coarse resolution is shown,

respectively representing the pixel coordinates and the real coordinates of the nth joint in the image;