CN114973305A - Accurate human body analysis method for crowded people - Google Patents

Accurate human body analysis method for crowded people Download PDF

Info

Publication number
CN114973305A
CN114973305A CN202111645897.XA CN202111645897A CN114973305A CN 114973305 A CN114973305 A CN 114973305A CN 202111645897 A CN202111645897 A CN 202111645897A CN 114973305 A CN114973305 A CN 114973305A
Authority
CN
China
Prior art keywords
human body
accurate
joint
image
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111645897.XA
Other languages
Chinese (zh)
Other versions
CN114973305B (en
Inventor
刘骊
韦勇
付晓东
黄青松
刘利军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202111645897.XA priority Critical patent/CN114973305B/en
Publication of CN114973305A publication Critical patent/CN114973305A/en
Application granted granted Critical
Publication of CN114973305B publication Critical patent/CN114973305B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an accurate human body analysis method for crowds, and belongs to the field of computer vision and image application. Firstly, inputting a crowded crowd image set, extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame; secondly, the foreground accurate semantic feature maps are sampled to the same size and fused together to generate high-resolution features, and a human body coarse analysis result is obtained through preliminary analysis; then, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result; and finally, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result. The invention can effectively analyze the human body images in crowds.

Description

Accurate human body analysis method for crowded people
Technical Field
The invention relates to an accurate human body analysis method for crowds, and belongs to the field of computer vision and image application.
Background
Human body analysis is a fine-grained semantic segmentation task, aims to identify components of pixel-level human body images, such as body parts and clothes, is a basic task in multimedia and computer vision, and has good potential for problems in various visual scenes, such as behavior analysis, video image understanding, intelligent security and the like. The known method considers the existence of semantic features of different sizes, and uses, for example, an FCN structure, a DeeplabV1 structure, a DeeplabV3 structure, a SegNet structure and an ASPP structure, and aims to improve human body analysis by extracting multi-scale semantic features. However, only considering multi-scale information does not well consider deep relationships between pixels, and complex interaction relationships between human instances in crowded scenes cannot be well modeled. In terms of technology, some key problems of human body analysis of crowds are not solved well, and the key problems are mainly reflected in three aspects: 1) the background is complex, and the background color is too similar to the clothes of a person; 2) the number of human examples is large in change, the motion postures of the human examples are various, and people in a complex motion environment have strong interaction and are difficult to determine the characteristic attribution problem; 3) there are complex occlusions in crowded environments, including self-occlusion of people, occlusion of people and things, and mutual occlusion between human instances. These occlusions have a large influence on the accuracy of human body analysis. The three aspects are the key to solve the human body analysis in crowds urgently.
Known human body analysis methods are mainly based on feature enhancement, multi-task methods and the like. For example, Zhang X (< neuro rendering >402,2020,375-383) proposes a Semantic Spatial Fusion Network (SSFNET) for human body parsing to narrow the semantic gap and to give accurate high resolution prediction by aggregating multi-resolution features. Zhang Z (< IEEE/CVF Conference on Computer Vision and Pattern Recognition >,2020, 8897-. However, although these known methods supplement human body analysis by using multi-scale semantics and other tasks, they achieve a good effect on single-person analysis, and can also be extended to multi-person situations by combining with a target detection algorithm. But it depends too much on the accuracy of the target detection method, does not consider the relationship between different human instances, and is difficult to produce good effect in case of crowd. The patent CN113111848A adds a hole convolution in each layer of features between the encoder and the decoder to perform multi-scale feature fusion so as to enhance the feature extraction capability of the model, and solves the problem that the traditional human body analysis method is insufficient in human body edge detection pixel precision. The method is only a simple stacked cavity convolution structure, has a large amount of redundant calculation, is not well suitable for a human body analysis task only by solving the problem of edge precision, only adds a cavity convolution layer in the last layer of characteristics, and adds a super-pixel characteristic characterization image internal structure in the middle of a codec, so that a preliminary human body structure can be obtained while obtaining a precise edge. The patent CN113537072A performs posture estimation and human body analysis tasks respectively by sharing multi-scale features extracted from the backbone and adopting a joint learning mode after non-localization processing. Although it takes into account the same points of the pose and the resolution task, the disparity between the two tasks is ignored and the method is only applicable to the single person resolution case.
Disclosure of Invention
The invention provides an accurate human body analysis method for crowded people, which is used for effectively analyzing images of the crowded people to obtain an accurate human body analysis result, thereby meeting the current accuracy requirement on the crowded people analysis.
The technical scheme of the invention is as follows: a method for accurately analyzing human bodies of crowds comprises the steps of firstly inputting crowds image sets, extracting coarse image layering characteristics and super-pixel characteristics through a depth residual error network, performing characteristic representation on human body images to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame; secondly, the foreground accurate semantic feature maps are sampled to the same size and fused together to generate high-resolution features, and a human body coarse analysis result is obtained through preliminary analysis; then, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result; and finally, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result.
The method comprises the following specific steps:
step1, input crowded group image set G ═ G 1 ,G 2 ,...G n Extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame;
step2, sampling foreground accurate semantic feature graphs of different scales to the same size through a bilinear interpolation method, fusing the foreground accurate semantic feature graphs to generate high-resolution features, and obtaining a human body coarse analysis result through preliminary analysis.
Step3, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame, defining a joint loss function to inhibit interference joints, generating human body joint points, defining a human body posture association rule, carrying out association connection on all the generated joint points, and refining to obtain a multi-person accurate posture estimation result;
and Step4, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result.
The Step1 is concretely as follows:
first, a hierarchical feature P ═ P { P } is extracted from an input crowd image set G using the ResNet101 1 ,P 2 ,P 3 ,P 4 ,P 5 And generating a series of superpixel partitions S ═ S using COB convolution-oriented boundaries 0 ,S 1 ,...,S N In which S is N Is a super pixel, S, representing the entire image N Is S N-1 Are combined. Matching P 2 ,P 4 ,P 5 Size of (d) is selected as a subset N ═ N in S 2 ,N 4 ,N 5 And h, wherein the number of nodes between adjacent levels is 1/4 times.
Then, for P 2 ,P 4 ,P 5 Performing feature mapping
Figure BDA0003443562720000031
To map to a graph matrix, where W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ min (P l e ) And Δ max (P l e ) Respectively representing minimum pooling and maximum pooling, P l e Representing grid cells that jointly correspond to hierarchical superpixel blocks.
And then extracting context and hierarchical information of the mapping characteristics through the graph neural network, combining the characteristic pyramid decoding characteristics for fusion, adding space and channel attention in the graph neural network for reducing redundant calculation, and obtaining a final characteristic representation result. Given mapping node i and a set of neighboring nodes C i The spatial attention of node i is expressed as:
Figure BDA0003443562720000032
wherein, M is a self-attention head,
Figure BDA0003443562720000033
is the sum of the feature vector sets collected by the neighbor nodes of the node i. The channel attention is shown as
Figure BDA0003443562720000034
Figure BDA0003443562720000035
Mean of the eigenvectors representing node i and its neighbors, σ represents the fully connected layer of Sigmod activation, and
Figure BDA0003443562720000036
element multiplication is performed. The attention can be finally paidIs shown as
Figure BDA0003443562720000037
Where β is the scale weight initialized to 0.
Finally, obtaining the foreground accurate semantic feature F based on the feature representation result f ={P' u U is more than or equal to 1 and less than or equal to 5 }. Inputting the feature representation result into the hierarchical cascade RPN to obtain a candidate region, and generating a human body candidate multi-region detection frame D ═ D through classification and regression prediction v V represents the number of people in the graph.
The Step3 is concretely as follows:
first, affine transformation is performed on all the detected human body candidate multi-region detection frames in Step 1.
Then, each transformed human body is respectively input into a single posture estimation module to generate a joint heat map
Figure BDA0003443562720000038
And defining two joint types, which are respectively target joints
Figure BDA0003443562720000039
And interference joint
Figure BDA00034435627200000310
Wherein
Figure BDA00034435627200000311
Definition loss
Figure BDA00034435627200000312
To inhibit interfering joints, where RSME is the root mean square error function, K denotes the kth joint of the v-th individual, and K is the total number of joints.
Then, defining a human body posture association rule, performing association connection on all generated joint points, and generating a skeleton labeling graph A ═ { A ═ A } L L denotes the number of generated connections. And defines a pose similarity function f (A) x ,A y ,η)=1[d(A x ,A y |Λ,λ)≤η]To eliminate similar gestures, wherein d (-) is the defined gestureAnd the state distance function is lambda, lambda is a parameter set, and eta is a distance threshold. If d ≦ η, determine A x For redundancy, elimination is performed. d (-) is in particular d (A) x ,A y |Λ)=Q(A x ,A y1 )+μH(A x ,A y2 ) And calculating the confidence coefficient of the matched joint points between the posture skeletons by the function Q to obtain the number of the matched joints between the postures. The function H calculates the spatial distance, η, between the joints 12 Is a tunable parameter in the function, μ is a weight that balances two distances, Λ ═ η 12 ,μ}。
Finally, restoring the coordinates of the original image through inverse affine transformation, and refining to generate an accurate attitude estimation result F pose
The specific process of Step4 is as follows:
firstly, an accurate attitude estimation result F is obtained pose And the coarse analysis result F obtained in Step2 parsing And (4) cascading, and segmenting labels of different human body examples through human body multi-region candidate frames and posture constraints.
Then by defining a semantic spatial distance loss, L f_parsing =λ 1 L parsing +L parsing L pose2 L pose To reduce the gap between different semantics. Wherein λ 1 Is to resolve the loss weight, λ 2 It is the weight loss for the pose that,
Figure BDA0003443562720000041
the loss of the coarse resolution is shown,
Figure BDA0003443562720000042
representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is i The category of the ith pixel point is represented; ln (f) m ) Representing the probability of prediction as an m-th class of semantics; n is the total number of joint points;
Figure BDA0003443562720000043
respectively represent the nth joint atPixel coordinates and true coordinates in the image.
And finally, mapping the cascaded result into NxC size through convolution, mapping the analysis result into CxN size, fusing, inputting into 3 convolution layers with 7 x 7 dimensionalities of 128 for thinning, and obtaining the final accurate analysis result.
The invention has the beneficial effects that:
1. the known method only adopts a neural network with fixed topology to carry out feature interaction in space and scale in multi-scale feature learning, ignores the internal structure of the image, and leads to the loss or weakening of basic feature information in the process of propagation and interaction. According to the method, the super-pixels are mapped to the graph nodes to inherit the inherent hierarchical structure of the image, and different levels and feature extraction middle-layer subsets are fused, so that not only can the enhanced multi-scale features be obtained, but also the inherent structure of the image can be better fused, finer-grained feature representation is provided for subsequent target detection and segmentation tasks, and the accuracy of subsequent human body analysis is improved.
2. Due to the fact that the image in a crowded scene has the problems of complex background and posture, large change of the number of human instances, mutual occlusion and the like. Most of the known human body analysis methods only consider the pixel precision in multiple scales to improve the human body analysis precision, so that the model is seriously degraded in a crowded scene, and the complicated posture and the shielding environment are not considered, so that higher precision is difficult to obtain. The invention provides the top-down multi-person posture estimation, different human bodies are segmented through accurate target detection, in addition, the semantic space distance loss is designed through modeling shielding and complex postures and combining human body semantic information obtained through human body analysis from bottom to top, the crowded crowd image is accurately analyzed, and the accuracy is higher.
3. The invention extracts common characteristics through the improved Res network, respectively carries out human body analysis and posture estimation, maps the posture characteristics into the human body analysis, supplements the human body analysis, and can effectively solve the problems existing in crowded human body analysis. According to the method, the super-pixel characteristics are added in the characteristic representation, the relationship between pixels is established by using the graph convolution, accurate foreground characteristics are obtained, and the crowd attitude estimation and the rough human body analysis are respectively carried out through the same foreground characteristics. The crowded posture accurate joint point characteristics and the human body part characteristics in the human body analysis are used for complementation, and a joint optimization module is designed to obtain an accurate human body analysis result, so that the precision is high.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a detailed flow chart of the features of the present invention.
FIG. 3 is a flowchart illustrating a rough analysis method according to the present invention.
FIG. 4 is a diagram of an example of an accurate pose estimation according to the present invention.
FIG. 5 is a diagram of an example of an accurate analysis result according to the present invention.
FIG. 6 is a flowchart illustrating an embodiment of the present invention.
Detailed Description
Example 1: as shown in fig. 1 to 6, a method for accurate human body analysis for crowded people includes the following steps:
step1, inputting a crowded crowd image set, extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame;
step2, sampling the foreground accurate semantic feature map to the same size, fusing the foreground accurate semantic feature map and the foreground accurate semantic feature map to generate high-resolution features, and obtaining a human body coarse analysis result through preliminary analysis;
step3, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result;
and Step4, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result.
Further, Step1 may be specifically set as follows:
first, a hierarchical feature P ═ P { P } is extracted from an input crowded population image set G by using ResNet-101 1 ,P 2 ,P 3 ,P 4 ,P 5 And generating a superpixel partition series S ═ S using COB convolution-oriented boundaries 0 ,S 1 ,...,S N In which S is N Is a super pixel, S, representing the entire image N Is S N-1 Are combined. Matching P 2 ,P 4 ,P 5 Size of (d) is selected as a subset N ═ N in S 2 ,N 4 ,N 5 And h, wherein the number of nodes between adjacent levels is 1/4 times.
Then, for P 2 ,P 4 ,P 5 Performing feature mapping
Figure BDA0003443562720000061
To map to a graph matrix, where W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ min (P l e ) And Δ max (P l e ) Respectively representing minimum pooling and maximum pooling, P l e Representing grid cells that jointly correspond to hierarchical superpixel blocks.
And then extracting context and hierarchical information of the mapping characteristics through the graph neural network, combining the characteristic pyramid decoding characteristics for fusion, adding space and channel attention in the graph neural network for reducing redundant calculation, and obtaining a final characteristic representation result. Given a mapping node i and a set of neighboring nodes C i The spatial attention of node i is expressed as:
Figure BDA0003443562720000062
wherein, M is the self-attention head,
Figure BDA0003443562720000063
is the sum of the feature vector sets collected by the neighbor nodes of the node i. The channel attention is shown as
Figure BDA0003443562720000064
Figure BDA0003443562720000065
Representing feature vectors of node i and its neighborsOn average, σ denotes the fully connected layer of Sigmod activation, and
Figure BDA0003443562720000066
element multiplication is performed. Attention may ultimately be expressed as
Figure BDA0003443562720000067
Where β is the scale weight initialized to 0.
Finally, obtaining the foreground accurate semantic feature F based on the feature representation result f ={P' u U is more than or equal to 1 and less than or equal to 5. Inputting the feature expression result into a hierarchical cascade RPN to obtain a candidate region, and generating a human body candidate multi-region detection frame D (D) through classification and regression prediction v V represents the number of people in the graph.
Further, Step3 may be specifically set as follows:
firstly, affine transformation is carried out on all the detected human body candidate multi-region detection frames D in Step 1.
Then, each transformed human body is respectively input into a single posture estimation module to generate a joint heat map
Figure BDA0003443562720000071
And defining two joint types, namely a target joint
Figure BDA0003443562720000072
And interference joint
Figure BDA0003443562720000073
Wherein
Figure BDA0003443562720000074
Definition loss
Figure BDA0003443562720000075
To inhibit interfering joints, where RSME is the root mean square error function, K denotes the kth joint of the v-th individual, and K is the total number of joints.
Then defining human body posture association rule, and closing all the generated joint pointsConnecting to generate skeleton labeled graph A ═ { A ═ A L L denotes the number of generated connections. And defining a pose similarity function f (A) x ,A y ,η)=1[d(A x ,A y |Λ,λ)≤η]To eliminate similar poses, where d (-) is a defined pose distance function, Λ, λ is a parameter set, and η is a distance threshold. If d ≦ η, determine A x For redundancy, elimination is performed. d (-) is in particular d (A) x ,A y |Λ)=Q(A x ,A y1 )+μH(A x ,A y2 ) And calculating the confidence coefficient of the matched joint points between the posture skeletons by the function Q to obtain the number of the matched joints between the postures. The function H calculates the spatial distance, η, between the joints 12 Is an adjustable parameter in the function, mu is a weight for balancing two distances, Λ ═ η 12 ,μ}。
Finally, original image coordinates are restored through inverse affine transformation, and an accurate attitude estimation result F is generated through refinement pose
Further, Step4 may be specifically as follows:
firstly, an accurate attitude estimation result F is obtained pose And the coarse analysis result F obtained in Step2 parsing And (4) cascading, and segmenting labels of different human body examples through human body multi-region candidate frames and posture constraints.
Then defining a semantic space distance loss L f_parsing =λ 1 L parsing +L parsing L pose2 L pose To reduce the gap between different semantics. Wherein λ 1 Is to resolve the loss weight, λ 2 It is the weight loss for the pose that,
Figure BDA0003443562720000076
the loss of the coarse resolution is shown,
Figure BDA0003443562720000077
representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is i It represents the ith pixel pointThe category of (d); ln (f) m ) Representing the probability of prediction as an m-th class of semantics; n is the total number of joints;
Figure BDA0003443562720000078
respectively representing the pixel coordinates and the real coordinates of the nth joint in the image.
And finally, mapping the cascaded result into NxC size through convolution, mapping the analysis result into CxN size, fusing, inputting into 3 convolution layers with 7 x 7 dimensionalities of 128 for thinning, and obtaining the final accurate analysis result.
Example 2: a precise human body analysis method aiming at crowds comprises the following specific steps:
step1, as shown in fig. 2, image (a) is an input original image; extracting hierarchical features from the input crowded crowd image (a) by using Resnet-101, see FIG. b, and generating a superpixel partition series (e) S ═ S using COB convolution-oriented boundaries 0 ,S 1 ,...,S N In which S is N Is a super pixel, S, representing the entire image N Is S N-1 Are combined. Matching P 2 ,P 4 ,P 5 Size of (d) is selected as a subset N ═ N in S 2 ,N 4 ,N 5 And (5) setting the number of nodes between adjacent layers to be 1/4 times.
Then to P 2 ,P 4 ,P 5 Performing feature mapping, specifically, firstly, P 2 ,P 4 ,P 5 Feature mapping to/ th On the rectangular grid, and then assigning grid cells to the super-pixel grid cell set P l e Wherein each grid corresponds to a small rectangular area of the input image (a). The concrete formula is
Figure BDA0003443562720000081
Where W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ min (P l e ) And Δ max (P l e ) Representing minimum pooling and maximum pooling, respectively.
Then extracting through a graph neural networkThe context and hierarchy information of the mapped features are taken, see fig (d). The three-layer graph neural network is respectively used for extracting context information, layering information and context information, each layer has own learning parameter and is not shared with other layers; in order to reduce the redundant calculation, space and channel attention is added in the graph neural network, and the combined feature pyramid decoding features are fused to obtain a final feature representation result, which is shown in a graph (c). Given mapping node i and a set of neighboring nodes C i The spatial attention of node i is expressed as:
Figure BDA0003443562720000082
wherein, M is the self-attention head,
Figure BDA0003443562720000083
is the sum of the feature vector sets collected by the neighbor nodes of node i. The channel attention is shown as
Figure BDA0003443562720000084
Figure BDA0003443562720000085
Mean of the eigenvectors representing the node and its neighbors, σ represents the fully connected layer activated by Sigmod, and
Figure BDA0003443562720000086
element multiplication is performed. Attention may ultimately be expressed as
Figure BDA0003443562720000087
Where β is the scale weight initialized to 0.
Finally, based on the feature representation result graph (c), obtaining the foreground accurate semantic feature F f ={P' u L 1 is less than or equal to u is less than or equal to 5}, the marked example graph is shown as (g), the characteristic representation result is input into the hierarchical cascade RPN to obtain a candidate region, and a human body candidate multi-region detection frame D is generated through classification and regression prediction v And v represents the number of people in the graph, see fig. (f).
The specific flowchart of Step1 is shown in fig. 2. After Step1, the foreground can be obtainedAccurate semantic features (g) and human body candidate multi-region detection frames (f). The data set is selected from a Human body analysis general data set, such as CIHP, Muti-Human matching V2.0 and the like, and is mainly based on multi-person images, wherein each image contains more than 3 persons on average, and the total number of the images is about 6 ten thousand: the method is divided into 43683 training sets, 10000 testing sets and 10000 verification sets. In this example, a pytorech is used to perform an experiment using a crowded population image as an input. The model is trained in the first stage, and training parameters are continuously adjusted, so that the model obtains better foreground semantic features and detection frame precision. The first stage of quantitative comparison is shown in Table 1, and this example is compared with other typical deep learning models for target detection in the known methods, such as fast RCNN, fast RCNN + FPN and YOLOV3, where Params is the number of learnable parameters of the model, GFLOPS is the number of floating point operations, Test Speed is the detection Speed, AP is the number of floating point operations, and bbox @0.5IOU is the detection frame average accuracy, and it is seen from the results that although the calculated amount is comparatively large, high accuracy is obtained.
TABLE 1
Method Params GFLOPS Test Speed/ms AP bbox @0.5IOU
fast RCNN 34.6M 172.3 13.9 65.6%
fast RCNN+FPN 64.1M 240.6 5.1 68.3%
YOLOV3 239M - 8.12 71.7%
The invention 113M 387.6 25.5 73.1%
Step2, as shown in FIG. 3, using Step1 foreground precise semantic feature (a) as input, and generating multi-scale feature P 'by bilinear interpolation' 2 ,P' 4 Up sampling to P' 5 Scale of grade, and respectively using 1 × 1 convolution to make characteristic arrangement to obtain P 2 ″,P 4 "for aligning the same semantic space, P is extracted using the ASPP structure shown in (c) 5 ' Multi-Scale feature P thereof 5 ". Association of P 2 ″,P 4 ″,P 5 "pass through 1 × 1 convolution layer to predict all human semantic regions, and divide to obtain coarse analysis result
Figure BDA0003443562720000091
θ represents a 1 × 1 convolution, see fig. 3 (b).
Step3, affine transformation is firstly performed on all the detected human body candidate multi-region detection frames obtained in Step1 in D.
Figure BDA0003443562720000092
Wherein beta is 123 Is a parameter vector.
Figure BDA0003443562720000093
And
Figure BDA0003443562720000094
the coordinates before and after transformation, respectively.
Then, as shown in table 2, 18 joint points are defined for each body part of the human body, and each of the transformed human bodies is input to the single posture estimation module. Generating joint heat maps
Figure BDA0003443562720000095
As shown in fig. 4 (b). Then two joint types are defined, which are respectively target joints
Figure BDA0003443562720000101
And interference joint
Figure BDA0003443562720000102
Wherein
Figure BDA0003443562720000103
By defining a loss function
Figure BDA0003443562720000104
To inhibit interfering joints, where RSME is the root mean square error function, K denotes the kth joint of the v-th individual, and K is the total number of joints. During training, joints which do not belong to the human body example in each detection box are restrained by using the defined loss function, and misconnection is reduced.
TABLE 2
Body part Joint point
Head part Joint of head and neck, left eye and right eye
Upper limb The left and right elbow joints, the left and right wrist joints
Lower limbs Left and right ankle joint, left and right knee joint
Trunk Left and right shoulder joints, left and right hip joints, pelvic joints, spinal joints
Then, defining a human body posture association rule, performing association connection on all generated joint points, and generating a skeleton labeling graph A ═ { A ═ A } L L denotes the number of generated connections. And defining a pose similarity function to eliminate similar poses: f (A) x ,A y ,η)=1[d(A x ,A y |Λ,λ)≤η]Where d (-) is a defined pose distance function, Λ, λ is a set of parameters, and η is a distance threshold. If d ≦ η, determine A x For redundancy, elimination is performed. d (-) is in particular d (A) x ,A y |Λ)=Q(A x ,A y1 )+μH(A x ,A y2 ) And calculating the confidence coefficient of the matched joint points between the posture skeletons by the function Q to obtain the number of the matched joints between the postures. The function H calculates the spatial distance, η, between the joints 12 Is a tunable parameter in the function, μ is a weight that balances two distances, Λ ═ η 12 μ }. The specific formula is as follows:
Figure BDA0003443562720000105
Figure BDA0003443562720000106
wherein
Figure BDA0003443562720000107
To represent
Figure BDA0003443562720000108
The center of the detection frame is detected,
Figure BDA0003443562720000109
indicates the position of the nth joint,
Figure BDA00034435627200001010
representing the confidence score for the location of the nth joint.
Finally, original image coordinates are restored through inverse affine transformation, and an accurate attitude estimation result F is generated through refinement pose As shown in fig. 4 (c).
As shown in tables 1 and 3 and fig. 4, after Step3, an accurate attitude estimation result F can be obtained pose And taking the human body candidate multi-region detection box generated by Step1 as an input, performing the second stage of training, and keeping the optimal training parameters of the two stages so as to be convenient for fine tuning during the last stage of training. The invention can obtain a detection result with higher precision, which is shown in table 1. Meanwhile, based on the detection frame, a human joint heat map is obtained, as shown in fig. 4 (b).
For multi-person pose estimation, the invention has higher precision, and table 3 shows that the example is compared with other models typically used in the known method, such as MASK RCNN, RMPE, HR-Net, and openpos, wherein ap (average precision) is the average accuracy rate used for calculating the precision percentage of the test set; OKS (object keypoint similarity) is the similarity of key points, and the similarity is calculated by adding the Euclidean distance of a scale, and is mainly used in a multi-person posture estimation task. The specific formula is as follows:
Figure BDA0003443562720000111
where v denotes someone in the GT and v k A key point representing a person's presence,
Figure BDA0003443562720000112
representing the euclidean distance of the currently detected set of keypoints to keypoints of the same id in the GT,
Figure BDA0003443562720000113
the visibility of this key point is 1, i.e. the key point is unobstructed and marked
Figure BDA0003443562720000114
Indicating that the key point is occluded but marked, S v Scale factor representing the human in GT, whose value is the square root of the area of the detection box, σ v And d represents a normalization factor of the key point with v, and delta (x) represents 1 if x is true, and 0 otherwise. T is a manually set threshold, from table 3 we take 50,75, mean M and minimum L.
TABLE 3
Method AP AP oks=50 AP oks=75 AP M AP L
Mask RCNN 62.9 87.6 67.9 57.5 71.3
RMPE 72.1 88.8 79.1 68.1 77.9
HR-Net 75.5 92.4 83.3 71.9 81.5
Openpose 61.8 84.9 67.5 57.1 68.2
The invention 76.4 90.1 84.5 72.3 83.1
Step4, first, the accurate attitude estimation result F pose FIG. 4(c) and the crude analysis result F obtained at Step2 parsing (FIG. 5(b)) a cascade is performed, segmenting the labels of different human instances by human multi-region candidate boxes and pose constraints.
Then by defining a semantic spatial distance loss, L f_parsing =λ 1 L parsing +L parsing L pose2 L pose To reduce the gap between different semantics. Wherein λ 1 Is to resolve the loss weight, λ 2 It is the weight loss for the pose that,
Figure BDA0003443562720000121
the loss of the coarse resolution is shown as,
Figure BDA0003443562720000122
representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is i The category of the ith pixel point is represented; ln (f) m ) Representing the probability of prediction as the mth type of semantics; n is the total number of joints;
Figure BDA0003443562720000123
respectively representing the pixel coordinates and the real coordinates of the nth joint in the image.
And finally, mapping the cascaded result into NxC size through convolution, mapping the coarse analysis result into CxN size, fusing and inputting the fused result into 3 convolution layers with 7 x 7 dimensionalities of 128 for thinning, and obtaining the final accurate analysis result.
As shown in fig. 6, the final accurate analysis result (c) is obtained after Step 4. The space semantic distance loss function is introduced to carry out the training of the third stage through the rough analysis result (b) and the posture estimation result (a) obtained by the training of the first two stages, the comparison of the example and other typical example-level human body analysis models such as NAN, M-CE2P and RP-R-CNN in the known method under the same data set is shown in the table 4, and the comparison can be seenThe indexes of the invention are superior to other known methods. Wherein the mIOU calculates the ratio of intersection and union of the two sets of the real value and the predicted value, and the ratio can be transformed into the sum (union) of TP, FP and FN on the ratio of TP (intersection), and the formula is that mIOU is TP/(FP + FN + TP); PCP 50 As a percentage of correct sites, if the distance between the predicted tag position and the true value is less than half of the total pixels (usually expressed as PCP @0.5), then it is considered to be correctly detected; AP (Access Point) P Is the average precision based on the part.
TABLE 4
Figure BDA0003443562720000124
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (4)

1. A precise human body analysis method for crowds is characterized in that: the method comprises the following steps:
step1, inputting a crowded crowd image set, extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame;
step2, sampling the foreground accurate semantic feature map to the same size, fusing the foreground accurate semantic feature map and the foreground accurate semantic feature map to generate high-resolution features, and obtaining a human body coarse analysis result through preliminary analysis;
step3, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result;
and Step4, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result.
2. The accurate human body analysis method for crowded people according to claim 1, wherein: the specific process of Step1 is as follows:
first, a hierarchical feature P ═ P { P } is extracted from an input crowded population image set G by using ResNet101 1 ,P 2 ,P 3 ,P 4 ,P 5 And generating a superpixel partition series S ═ S using COB convolution-oriented boundaries 0 ,S 1 ,...,S N In which S is N Is a super pixel, S, representing the entire image N Is S N-1 In two superpixel combinations, match P 2 ,P 4 ,P 5 Size of (d) is selected as a subset N ═ N in S 2 ,N 4 ,N 5 -wherein the number of inter-adjacent-level nodes is 1/4 times;
then, for P 2 ,P 4 ,P 5 Performing feature mapping
Figure FDA0003443562710000011
To map to a graph matrix, where W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ min (P l e ) And Δ max (P l e ) Respectively representing minimum pooling and maximum pooling, P l e A grid cell representing a joint corresponding hierarchical superpixel partition;
then, extracting context and hierarchical information of mapping characteristics through a graph neural network, combining characteristic pyramid decoding characteristics for fusion, adding space and channel attention in the graph neural network for reducing redundant calculation, and obtaining a final characteristic representation result; given mapping node i and a set of neighboring nodes C i The spatial attention of node i is expressed as:
Figure FDA0003443562710000012
wherein, M is a self-attention head,
Figure FDA0003443562710000013
is node i neighbor node collectionThe sum of the feature vector sets of (a); the channel attention is shown as
Figure FDA0003443562710000014
Figure FDA0003443562710000015
Mean of the eigenvectors representing node i and its neighbors, σ represents the fully connected layer of Sigmod activation, and
Figure FDA0003443562710000016
performing element multiplication, attention is finally expressed as
Figure FDA0003443562710000021
Where β is the scale weight initialized to 0;
finally, obtaining the foreground accurate semantic feature F based on the feature representation result f ={P′ u L 1 is more than or equal to u is less than or equal to 5, the feature representation result is input into the hierarchical cascade RPN to obtain a candidate region, and a human body candidate multi-region detection frame D is generated through classification and regression prediction v V represents the number of people in the graph.
3. The accurate human body analysis method for crowded people according to claim 1, wherein: the specific process of Step3 is as follows:
firstly, carrying out affine transformation on all detected human body candidate multi-region detection frames D in Step 1;
then, each transformed human body is respectively input into a single posture estimation module to generate a joint heat map
Figure FDA0003443562710000022
And defining two joint types, which are respectively target joints
Figure FDA0003443562710000023
And interference joint
Figure FDA0003443562710000024
Wherein
Figure FDA0003443562710000025
Definition loss
Figure FDA0003443562710000026
To inhibit interfering joints, wherein RSME is a root mean square error function, K denotes the kth joint of the v-th individual, K is the total number of joints;
then, defining a human body posture association rule, performing association connection on all generated joint points, and generating a skeleton labeling graph A ═ { A ═ A } L L denotes the number of generated connections and defines a pose similarity function f (A) x ,A y ,η)=1[d(A x ,A y |Λ,λ)≤η]Eliminating similar gestures, wherein d (-) is a defined gesture distance function, Λ, λ is a parameter set, and η is a distance threshold; if d ≦ η, determine A x For redundancy, elimination is performed; d (-) is in particular d (A) x ,A y |Λ)=Q(A x ,A y1 )+μH(A x ,A y2 ) Wherein function Q calculates confidence of the matched joint points between the posture skeletons to obtain the number of the matched joints between the postures, and function H calculates the space distance, eta, between the joints 12 Is a tunable parameter in the function, μ is a weight that balances two distances, Λ ═ η 12 ,μ};
Finally, original image coordinates are restored through inverse affine transformation, and an accurate attitude estimation result F is generated through refinement pose
4. The method of claim 1, wherein the method comprises: the specific process of Step4 is as follows:
firstly, an accurate attitude estimation result F is obtained pose And the coarse analysis result F obtained in Step2 parsing Cascading, namely segmenting different human body example labels through human body multi-region candidate frames and posture constraints;
then defining a semantic spaceDistance loss L f_parsing =λ 1 L parsing +L parsing L pose2 L pose To reduce the gap between different semantics; wherein λ 1 Is to resolve the loss weight, λ 2 It is the weight loss for the pose that,
Figure FDA0003443562710000031
the loss of the coarse resolution is shown,
Figure FDA0003443562710000032
representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is i The category of the ith pixel point is represented; ln (f) m ) Representing the probability of prediction as an m-th class of semantics; n is the total number of joint points;
Figure FDA0003443562710000033
respectively representing the pixel coordinates and the real coordinates of the nth joint in the image;
and finally, mapping the cascaded result into NxC size through convolution, mapping the analysis result into CxN size, fusing, inputting into 3 convolution layers with 7 x 7 dimensionalities of 128 for thinning, and obtaining the final accurate analysis result.
CN202111645897.XA 2021-12-30 2021-12-30 Accurate human body analysis method for crowded people Active CN114973305B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111645897.XA CN114973305B (en) 2021-12-30 2021-12-30 Accurate human body analysis method for crowded people

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111645897.XA CN114973305B (en) 2021-12-30 2021-12-30 Accurate human body analysis method for crowded people

Publications (2)

Publication Number Publication Date
CN114973305A true CN114973305A (en) 2022-08-30
CN114973305B CN114973305B (en) 2023-03-28

Family

ID=82975210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111645897.XA Active CN114973305B (en) 2021-12-30 2021-12-30 Accurate human body analysis method for crowded people

Country Status (1)

Country Link
CN (1) CN114973305B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115205906A (en) * 2022-09-15 2022-10-18 山东能源数智云科技有限公司 Method, device and medium for detecting warehousing operation personnel based on human body analysis

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194105A (en) * 2010-03-19 2011-09-21 微软公司 Proxy training data for human body tracking
CN103164858A (en) * 2013-03-20 2013-06-19 浙江大学 Adhered crowd segmenting and tracking methods based on superpixel and graph model
EP3016027A2 (en) * 2014-10-30 2016-05-04 Panasonic Intellectual Property Management Co., Ltd. Human body part detection system and human body part detection method
CN108564012A (en) * 2018-03-29 2018-09-21 北京工业大学 A kind of pedestrian's analytic method based on characteristics of human body's distribution
US20180300540A1 (en) * 2017-04-14 2018-10-18 Koninklijke Philips N.V. Person identification systems and methods
CN111062274A (en) * 2019-12-02 2020-04-24 汇纳科技股份有限公司 Context-aware embedded crowd counting method, system, medium, and electronic device
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method
US20210082136A1 (en) * 2018-12-04 2021-03-18 Yoti Holding Limited Extracting information from images
CN113111848A (en) * 2021-04-29 2021-07-13 东南大学 Human body image analysis method based on multi-scale features
CN113592893A (en) * 2021-08-29 2021-11-02 浙江工业大学 Image foreground segmentation method combining determined main body and refined edge
CN113673327A (en) * 2021-07-14 2021-11-19 南京邮电大学 Penalty ball hit prediction method based on human body posture estimation
CN113723255A (en) * 2021-08-24 2021-11-30 中国地质大学(武汉) Hyperspectral image classification method and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194105A (en) * 2010-03-19 2011-09-21 微软公司 Proxy training data for human body tracking
CN103164858A (en) * 2013-03-20 2013-06-19 浙江大学 Adhered crowd segmenting and tracking methods based on superpixel and graph model
EP3016027A2 (en) * 2014-10-30 2016-05-04 Panasonic Intellectual Property Management Co., Ltd. Human body part detection system and human body part detection method
US20180300540A1 (en) * 2017-04-14 2018-10-18 Koninklijke Philips N.V. Person identification systems and methods
CN108564012A (en) * 2018-03-29 2018-09-21 北京工业大学 A kind of pedestrian's analytic method based on characteristics of human body's distribution
US20210082136A1 (en) * 2018-12-04 2021-03-18 Yoti Holding Limited Extracting information from images
CN111062274A (en) * 2019-12-02 2020-04-24 汇纳科技股份有限公司 Context-aware embedded crowd counting method, system, medium, and electronic device
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method
CN113111848A (en) * 2021-04-29 2021-07-13 东南大学 Human body image analysis method based on multi-scale features
CN113673327A (en) * 2021-07-14 2021-11-19 南京邮电大学 Penalty ball hit prediction method based on human body posture estimation
CN113723255A (en) * 2021-08-24 2021-11-30 中国地质大学(武汉) Hyperspectral image classification method and storage medium
CN113592893A (en) * 2021-08-29 2021-11-02 浙江工业大学 Image foreground segmentation method combining determined main body and refined edge

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JIAN ZHAO,JIANSHU LI: "Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing", 《PROCEEDINGS OF THE 26TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》 *
JIANSHU LI: "Multi-Human Parsing in the Wild", 《CSDN》 *
THOMAS GOLDA: "Human Pose Estimation for Real-World Crowded Scenarios", 《腾讯云开发者社区》 *
甘霖: "面向二维虚拟试衣的着装人体解析方法研究", 《昆明理工大学》 *
甘霖; 刘骊; 刘利军; 付晓东; 黄青松: "结合边缘轮廓和姿态特征的人体精确解析模型", 《计算机辅助设计与图形学学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115205906A (en) * 2022-09-15 2022-10-18 山东能源数智云科技有限公司 Method, device and medium for detecting warehousing operation personnel based on human body analysis

Also Published As

Publication number Publication date
CN114973305B (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN111339903B (en) Multi-person human body posture estimation method
CN111476181B (en) Human skeleton action recognition method
CN108154118B (en) A kind of target detection system and method based on adaptive combined filter and multistage detection
CN107609460B (en) Human body behavior recognition method integrating space-time dual network flow and attention mechanism
CN111310659B (en) Human body action recognition method based on enhanced graph convolution neural network
WO2020108362A1 (en) Body posture detection method, apparatus and device, and storage medium
Yang et al. Extraction of 2d motion trajectories and its application to hand gesture recognition
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
Asif et al. A multi-modal, discriminative and spatially invariant CNN for RGB-D object labeling
CN113408455B (en) Action identification method, system and storage medium based on multi-stream information enhanced graph convolution network
CN108960184B (en) Pedestrian re-identification method based on heterogeneous component deep neural network
CN112200111A (en) Global and local feature fused occlusion robust pedestrian re-identification method
CN112307995B (en) Semi-supervised pedestrian re-identification method based on feature decoupling learning
CN112347861B (en) Human body posture estimation method based on motion feature constraint
Sincan et al. Using motion history images with 3d convolutional networks in isolated sign language recognition
CN111310668B (en) Gait recognition method based on skeleton information
Li et al. Effective person re-identification by self-attention model guided feature learning
Zhou et al. Face parsing via a fully-convolutional continuous CRF neural network
CN113052017B (en) Unsupervised pedestrian re-identification method based on multi-granularity feature representation and domain self-adaptive learning
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
Rani et al. An effectual classical dance pose estimation and classification system employing convolution neural network–long shortterm memory (CNN-LSTM) network for video sequences
CN114973305B (en) Accurate human body analysis method for crowded people
CN114187653A (en) Behavior identification method based on multi-stream fusion graph convolution network
Liu et al. Bayesian inferred self-attentive aggregation for multi-shot person re-identification
Lei et al. Continuous action recognition based on hybrid CNN-LDCRF model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant