CN114973305A - Accurate human body analysis method for crowded people - Google Patents
Accurate human body analysis method for crowded people Download PDFInfo
- Publication number
- CN114973305A CN114973305A CN202111645897.XA CN202111645897A CN114973305A CN 114973305 A CN114973305 A CN 114973305A CN 202111645897 A CN202111645897 A CN 202111645897A CN 114973305 A CN114973305 A CN 114973305A
- Authority
- CN
- China
- Prior art keywords
- human body
- accurate
- joint
- image
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 74
- 238000001514 detection method Methods 0.000 claims abstract description 33
- 238000010586 diagram Methods 0.000 claims abstract description 7
- 238000005457 optimization Methods 0.000 claims abstract description 6
- 238000007670 refining Methods 0.000 claims abstract description 6
- 230000036544 posture Effects 0.000 claims description 43
- 238000000034 method Methods 0.000 claims description 27
- 238000013507 mapping Methods 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 9
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000004927 fusion Effects 0.000 claims description 5
- 238000005192 partition Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 230000008030 elimination Effects 0.000 claims description 4
- 238000003379 elimination reaction Methods 0.000 claims description 4
- 230000002452 interceptive effect Effects 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 230000004580 weight loss Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 21
- 210000001503 joint Anatomy 0.000 description 18
- 238000012549 training Methods 0.000 description 8
- 230000003993 interaction Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 101100295091 Arabidopsis thaliana NUDT14 gene Proteins 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000007667 floating Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 210000000544 articulatio talocruralis Anatomy 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 210000002310 elbow joint Anatomy 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 210000004394 hip joint Anatomy 0.000 description 1
- 210000000629 knee joint Anatomy 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 210000003141 lower extremity Anatomy 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 210000000323 shoulder joint Anatomy 0.000 description 1
- 210000001364 upper extremity Anatomy 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
- 210000003857 wrist joint Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to an accurate human body analysis method for crowds, and belongs to the field of computer vision and image application. Firstly, inputting a crowded crowd image set, extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame; secondly, the foreground accurate semantic feature maps are sampled to the same size and fused together to generate high-resolution features, and a human body coarse analysis result is obtained through preliminary analysis; then, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result; and finally, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result. The invention can effectively analyze the human body images in crowds.
Description
Technical Field
The invention relates to an accurate human body analysis method for crowds, and belongs to the field of computer vision and image application.
Background
Human body analysis is a fine-grained semantic segmentation task, aims to identify components of pixel-level human body images, such as body parts and clothes, is a basic task in multimedia and computer vision, and has good potential for problems in various visual scenes, such as behavior analysis, video image understanding, intelligent security and the like. The known method considers the existence of semantic features of different sizes, and uses, for example, an FCN structure, a DeeplabV1 structure, a DeeplabV3 structure, a SegNet structure and an ASPP structure, and aims to improve human body analysis by extracting multi-scale semantic features. However, only considering multi-scale information does not well consider deep relationships between pixels, and complex interaction relationships between human instances in crowded scenes cannot be well modeled. In terms of technology, some key problems of human body analysis of crowds are not solved well, and the key problems are mainly reflected in three aspects: 1) the background is complex, and the background color is too similar to the clothes of a person; 2) the number of human examples is large in change, the motion postures of the human examples are various, and people in a complex motion environment have strong interaction and are difficult to determine the characteristic attribution problem; 3) there are complex occlusions in crowded environments, including self-occlusion of people, occlusion of people and things, and mutual occlusion between human instances. These occlusions have a large influence on the accuracy of human body analysis. The three aspects are the key to solve the human body analysis in crowds urgently.
Known human body analysis methods are mainly based on feature enhancement, multi-task methods and the like. For example, Zhang X (< neuro rendering >402,2020,375-383) proposes a Semantic Spatial Fusion Network (SSFNET) for human body parsing to narrow the semantic gap and to give accurate high resolution prediction by aggregating multi-resolution features. Zhang Z (< IEEE/CVF Conference on Computer Vision and Pattern Recognition >,2020, 8897-. However, although these known methods supplement human body analysis by using multi-scale semantics and other tasks, they achieve a good effect on single-person analysis, and can also be extended to multi-person situations by combining with a target detection algorithm. But it depends too much on the accuracy of the target detection method, does not consider the relationship between different human instances, and is difficult to produce good effect in case of crowd. The patent CN113111848A adds a hole convolution in each layer of features between the encoder and the decoder to perform multi-scale feature fusion so as to enhance the feature extraction capability of the model, and solves the problem that the traditional human body analysis method is insufficient in human body edge detection pixel precision. The method is only a simple stacked cavity convolution structure, has a large amount of redundant calculation, is not well suitable for a human body analysis task only by solving the problem of edge precision, only adds a cavity convolution layer in the last layer of characteristics, and adds a super-pixel characteristic characterization image internal structure in the middle of a codec, so that a preliminary human body structure can be obtained while obtaining a precise edge. The patent CN113537072A performs posture estimation and human body analysis tasks respectively by sharing multi-scale features extracted from the backbone and adopting a joint learning mode after non-localization processing. Although it takes into account the same points of the pose and the resolution task, the disparity between the two tasks is ignored and the method is only applicable to the single person resolution case.
Disclosure of Invention
The invention provides an accurate human body analysis method for crowded people, which is used for effectively analyzing images of the crowded people to obtain an accurate human body analysis result, thereby meeting the current accuracy requirement on the crowded people analysis.
The technical scheme of the invention is as follows: a method for accurately analyzing human bodies of crowds comprises the steps of firstly inputting crowds image sets, extracting coarse image layering characteristics and super-pixel characteristics through a depth residual error network, performing characteristic representation on human body images to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame; secondly, the foreground accurate semantic feature maps are sampled to the same size and fused together to generate high-resolution features, and a human body coarse analysis result is obtained through preliminary analysis; then, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result; and finally, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result.
The method comprises the following specific steps:
step1, input crowded group image set G ═ G 1 ,G 2 ,...G n Extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame;
step2, sampling foreground accurate semantic feature graphs of different scales to the same size through a bilinear interpolation method, fusing the foreground accurate semantic feature graphs to generate high-resolution features, and obtaining a human body coarse analysis result through preliminary analysis.
Step3, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame, defining a joint loss function to inhibit interference joints, generating human body joint points, defining a human body posture association rule, carrying out association connection on all the generated joint points, and refining to obtain a multi-person accurate posture estimation result;
and Step4, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result.
The Step1 is concretely as follows:
first, a hierarchical feature P ═ P { P } is extracted from an input crowd image set G using the ResNet101 1 ,P 2 ,P 3 ,P 4 ,P 5 And generating a series of superpixel partitions S ═ S using COB convolution-oriented boundaries 0 ,S 1 ,...,S N In which S is N Is a super pixel, S, representing the entire image N Is S N-1 Are combined. Matching P 2 ,P 4 ,P 5 Size of (d) is selected as a subset N ═ N in S 2 ,N 4 ,N 5 And h, wherein the number of nodes between adjacent levels is 1/4 times.
Then, for P 2 ,P 4 ,P 5 Performing feature mappingTo map to a graph matrix, where W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ min (P l e ) And Δ max (P l e ) Respectively representing minimum pooling and maximum pooling, P l e Representing grid cells that jointly correspond to hierarchical superpixel blocks.
And then extracting context and hierarchical information of the mapping characteristics through the graph neural network, combining the characteristic pyramid decoding characteristics for fusion, adding space and channel attention in the graph neural network for reducing redundant calculation, and obtaining a final characteristic representation result. Given mapping node i and a set of neighboring nodes C i The spatial attention of node i is expressed as:wherein, M is a self-attention head,is the sum of the feature vector sets collected by the neighbor nodes of the node i. The channel attention is shown as Mean of the eigenvectors representing node i and its neighbors, σ represents the fully connected layer of Sigmod activation, andelement multiplication is performed. The attention can be finally paidIs shown asWhere β is the scale weight initialized to 0.
Finally, obtaining the foreground accurate semantic feature F based on the feature representation result f ={P' u U is more than or equal to 1 and less than or equal to 5 }. Inputting the feature representation result into the hierarchical cascade RPN to obtain a candidate region, and generating a human body candidate multi-region detection frame D ═ D through classification and regression prediction v V represents the number of people in the graph.
The Step3 is concretely as follows:
first, affine transformation is performed on all the detected human body candidate multi-region detection frames in Step 1.
Then, each transformed human body is respectively input into a single posture estimation module to generate a joint heat mapAnd defining two joint types, which are respectively target jointsAnd interference jointWhereinDefinition lossTo inhibit interfering joints, where RSME is the root mean square error function, K denotes the kth joint of the v-th individual, and K is the total number of joints.
Then, defining a human body posture association rule, performing association connection on all generated joint points, and generating a skeleton labeling graph A ═ { A ═ A } L L denotes the number of generated connections. And defines a pose similarity function f (A) x ,A y ,η)=1[d(A x ,A y |Λ,λ)≤η]To eliminate similar gestures, wherein d (-) is the defined gestureAnd the state distance function is lambda, lambda is a parameter set, and eta is a distance threshold. If d ≦ η, determine A x For redundancy, elimination is performed. d (-) is in particular d (A) x ,A y |Λ)=Q(A x ,A y |η 1 )+μH(A x ,A y |η 2 ) And calculating the confidence coefficient of the matched joint points between the posture skeletons by the function Q to obtain the number of the matched joints between the postures. The function H calculates the spatial distance, η, between the joints 1 ,η 2 Is a tunable parameter in the function, μ is a weight that balances two distances, Λ ═ η 1 ,η 2 ,μ}。
Finally, restoring the coordinates of the original image through inverse affine transformation, and refining to generate an accurate attitude estimation result F pose 。
The specific process of Step4 is as follows:
firstly, an accurate attitude estimation result F is obtained pose And the coarse analysis result F obtained in Step2 parsing And (4) cascading, and segmenting labels of different human body examples through human body multi-region candidate frames and posture constraints.
Then by defining a semantic spatial distance loss, L f_parsing =λ 1 L parsing +L parsing L pose +λ 2 L pose To reduce the gap between different semantics. Wherein λ 1 Is to resolve the loss weight, λ 2 It is the weight loss for the pose that,the loss of the coarse resolution is shown,representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is i The category of the ith pixel point is represented; ln (f) m ) Representing the probability of prediction as an m-th class of semantics; n is the total number of joint points;respectively represent the nth joint atPixel coordinates and true coordinates in the image.
And finally, mapping the cascaded result into NxC size through convolution, mapping the analysis result into CxN size, fusing, inputting into 3 convolution layers with 7 x 7 dimensionalities of 128 for thinning, and obtaining the final accurate analysis result.
The invention has the beneficial effects that:
1. the known method only adopts a neural network with fixed topology to carry out feature interaction in space and scale in multi-scale feature learning, ignores the internal structure of the image, and leads to the loss or weakening of basic feature information in the process of propagation and interaction. According to the method, the super-pixels are mapped to the graph nodes to inherit the inherent hierarchical structure of the image, and different levels and feature extraction middle-layer subsets are fused, so that not only can the enhanced multi-scale features be obtained, but also the inherent structure of the image can be better fused, finer-grained feature representation is provided for subsequent target detection and segmentation tasks, and the accuracy of subsequent human body analysis is improved.
2. Due to the fact that the image in a crowded scene has the problems of complex background and posture, large change of the number of human instances, mutual occlusion and the like. Most of the known human body analysis methods only consider the pixel precision in multiple scales to improve the human body analysis precision, so that the model is seriously degraded in a crowded scene, and the complicated posture and the shielding environment are not considered, so that higher precision is difficult to obtain. The invention provides the top-down multi-person posture estimation, different human bodies are segmented through accurate target detection, in addition, the semantic space distance loss is designed through modeling shielding and complex postures and combining human body semantic information obtained through human body analysis from bottom to top, the crowded crowd image is accurately analyzed, and the accuracy is higher.
3. The invention extracts common characteristics through the improved Res network, respectively carries out human body analysis and posture estimation, maps the posture characteristics into the human body analysis, supplements the human body analysis, and can effectively solve the problems existing in crowded human body analysis. According to the method, the super-pixel characteristics are added in the characteristic representation, the relationship between pixels is established by using the graph convolution, accurate foreground characteristics are obtained, and the crowd attitude estimation and the rough human body analysis are respectively carried out through the same foreground characteristics. The crowded posture accurate joint point characteristics and the human body part characteristics in the human body analysis are used for complementation, and a joint optimization module is designed to obtain an accurate human body analysis result, so that the precision is high.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a detailed flow chart of the features of the present invention.
FIG. 3 is a flowchart illustrating a rough analysis method according to the present invention.
FIG. 4 is a diagram of an example of an accurate pose estimation according to the present invention.
FIG. 5 is a diagram of an example of an accurate analysis result according to the present invention.
FIG. 6 is a flowchart illustrating an embodiment of the present invention.
Detailed Description
Example 1: as shown in fig. 1 to 6, a method for accurate human body analysis for crowded people includes the following steps:
step1, inputting a crowded crowd image set, extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame;
step2, sampling the foreground accurate semantic feature map to the same size, fusing the foreground accurate semantic feature map and the foreground accurate semantic feature map to generate high-resolution features, and obtaining a human body coarse analysis result through preliminary analysis;
step3, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result;
and Step4, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result.
Further, Step1 may be specifically set as follows:
first, a hierarchical feature P ═ P { P } is extracted from an input crowded population image set G by using ResNet-101 1 ,P 2 ,P 3 ,P 4 ,P 5 And generating a superpixel partition series S ═ S using COB convolution-oriented boundaries 0 ,S 1 ,...,S N In which S is N Is a super pixel, S, representing the entire image N Is S N-1 Are combined. Matching P 2 ,P 4 ,P 5 Size of (d) is selected as a subset N ═ N in S 2 ,N 4 ,N 5 And h, wherein the number of nodes between adjacent levels is 1/4 times.
Then, for P 2 ,P 4 ,P 5 Performing feature mappingTo map to a graph matrix, where W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ min (P l e ) And Δ max (P l e ) Respectively representing minimum pooling and maximum pooling, P l e Representing grid cells that jointly correspond to hierarchical superpixel blocks.
And then extracting context and hierarchical information of the mapping characteristics through the graph neural network, combining the characteristic pyramid decoding characteristics for fusion, adding space and channel attention in the graph neural network for reducing redundant calculation, and obtaining a final characteristic representation result. Given a mapping node i and a set of neighboring nodes C i The spatial attention of node i is expressed as:wherein, M is the self-attention head,is the sum of the feature vector sets collected by the neighbor nodes of the node i. The channel attention is shown as Representing feature vectors of node i and its neighborsOn average, σ denotes the fully connected layer of Sigmod activation, andelement multiplication is performed. Attention may ultimately be expressed asWhere β is the scale weight initialized to 0.
Finally, obtaining the foreground accurate semantic feature F based on the feature representation result f ={P' u U is more than or equal to 1 and less than or equal to 5. Inputting the feature expression result into a hierarchical cascade RPN to obtain a candidate region, and generating a human body candidate multi-region detection frame D (D) through classification and regression prediction v V represents the number of people in the graph.
Further, Step3 may be specifically set as follows:
firstly, affine transformation is carried out on all the detected human body candidate multi-region detection frames D in Step 1.
Then, each transformed human body is respectively input into a single posture estimation module to generate a joint heat mapAnd defining two joint types, namely a target jointAnd interference jointWhereinDefinition lossTo inhibit interfering joints, where RSME is the root mean square error function, K denotes the kth joint of the v-th individual, and K is the total number of joints.
Then defining human body posture association rule, and closing all the generated joint pointsConnecting to generate skeleton labeled graph A ═ { A ═ A L L denotes the number of generated connections. And defining a pose similarity function f (A) x ,A y ,η)=1[d(A x ,A y |Λ,λ)≤η]To eliminate similar poses, where d (-) is a defined pose distance function, Λ, λ is a parameter set, and η is a distance threshold. If d ≦ η, determine A x For redundancy, elimination is performed. d (-) is in particular d (A) x ,A y |Λ)=Q(A x ,A y |η 1 )+μH(A x ,A y |η 2 ) And calculating the confidence coefficient of the matched joint points between the posture skeletons by the function Q to obtain the number of the matched joints between the postures. The function H calculates the spatial distance, η, between the joints 1 ,η 2 Is an adjustable parameter in the function, mu is a weight for balancing two distances, Λ ═ η 1 ,η 2 ,μ}。
Finally, original image coordinates are restored through inverse affine transformation, and an accurate attitude estimation result F is generated through refinement pose 。
Further, Step4 may be specifically as follows:
firstly, an accurate attitude estimation result F is obtained pose And the coarse analysis result F obtained in Step2 parsing And (4) cascading, and segmenting labels of different human body examples through human body multi-region candidate frames and posture constraints.
Then defining a semantic space distance loss L f_parsing =λ 1 L parsing +L parsing L pose +λ 2 L pose To reduce the gap between different semantics. Wherein λ 1 Is to resolve the loss weight, λ 2 It is the weight loss for the pose that,the loss of the coarse resolution is shown,representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is i It represents the ith pixel pointThe category of (d); ln (f) m ) Representing the probability of prediction as an m-th class of semantics; n is the total number of joints;respectively representing the pixel coordinates and the real coordinates of the nth joint in the image.
And finally, mapping the cascaded result into NxC size through convolution, mapping the analysis result into CxN size, fusing, inputting into 3 convolution layers with 7 x 7 dimensionalities of 128 for thinning, and obtaining the final accurate analysis result.
Example 2: a precise human body analysis method aiming at crowds comprises the following specific steps:
step1, as shown in fig. 2, image (a) is an input original image; extracting hierarchical features from the input crowded crowd image (a) by using Resnet-101, see FIG. b, and generating a superpixel partition series (e) S ═ S using COB convolution-oriented boundaries 0 ,S 1 ,...,S N In which S is N Is a super pixel, S, representing the entire image N Is S N-1 Are combined. Matching P 2 ,P 4 ,P 5 Size of (d) is selected as a subset N ═ N in S 2 ,N 4 ,N 5 And (5) setting the number of nodes between adjacent layers to be 1/4 times.
Then to P 2 ,P 4 ,P 5 Performing feature mapping, specifically, firstly, P 2 ,P 4 ,P 5 Feature mapping to/ th On the rectangular grid, and then assigning grid cells to the super-pixel grid cell set P l e Wherein each grid corresponds to a small rectangular area of the input image (a). The concrete formula isWhere W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ min (P l e ) And Δ max (P l e ) Representing minimum pooling and maximum pooling, respectively.
Then extracting through a graph neural networkThe context and hierarchy information of the mapped features are taken, see fig (d). The three-layer graph neural network is respectively used for extracting context information, layering information and context information, each layer has own learning parameter and is not shared with other layers; in order to reduce the redundant calculation, space and channel attention is added in the graph neural network, and the combined feature pyramid decoding features are fused to obtain a final feature representation result, which is shown in a graph (c). Given mapping node i and a set of neighboring nodes C i The spatial attention of node i is expressed as:wherein, M is the self-attention head,is the sum of the feature vector sets collected by the neighbor nodes of node i. The channel attention is shown as Mean of the eigenvectors representing the node and its neighbors, σ represents the fully connected layer activated by Sigmod, andelement multiplication is performed. Attention may ultimately be expressed asWhere β is the scale weight initialized to 0.
Finally, based on the feature representation result graph (c), obtaining the foreground accurate semantic feature F f ={P' u L 1 is less than or equal to u is less than or equal to 5}, the marked example graph is shown as (g), the characteristic representation result is input into the hierarchical cascade RPN to obtain a candidate region, and a human body candidate multi-region detection frame D is generated through classification and regression prediction v And v represents the number of people in the graph, see fig. (f).
The specific flowchart of Step1 is shown in fig. 2. After Step1, the foreground can be obtainedAccurate semantic features (g) and human body candidate multi-region detection frames (f). The data set is selected from a Human body analysis general data set, such as CIHP, Muti-Human matching V2.0 and the like, and is mainly based on multi-person images, wherein each image contains more than 3 persons on average, and the total number of the images is about 6 ten thousand: the method is divided into 43683 training sets, 10000 testing sets and 10000 verification sets. In this example, a pytorech is used to perform an experiment using a crowded population image as an input. The model is trained in the first stage, and training parameters are continuously adjusted, so that the model obtains better foreground semantic features and detection frame precision. The first stage of quantitative comparison is shown in Table 1, and this example is compared with other typical deep learning models for target detection in the known methods, such as fast RCNN, fast RCNN + FPN and YOLOV3, where Params is the number of learnable parameters of the model, GFLOPS is the number of floating point operations, Test Speed is the detection Speed, AP is the number of floating point operations, and bbox @0.5IOU is the detection frame average accuracy, and it is seen from the results that although the calculated amount is comparatively large, high accuracy is obtained.
TABLE 1
Method | Params | GFLOPS | Test Speed/ms | AP bbox @0.5IOU |
fast RCNN | 34.6M | 172.3 | 13.9 | 65.6% |
fast RCNN+FPN | 64.1M | 240.6 | 5.1 | 68.3% |
YOLOV3 | 239M | - | 8.12 | 71.7% |
The invention | 113M | 387.6 | 25.5 | 73.1% |
Step2, as shown in FIG. 3, using Step1 foreground precise semantic feature (a) as input, and generating multi-scale feature P 'by bilinear interpolation' 2 ,P' 4 Up sampling to P' 5 Scale of grade, and respectively using 1 × 1 convolution to make characteristic arrangement to obtain P 2 ″,P 4 "for aligning the same semantic space, P is extracted using the ASPP structure shown in (c) 5 ' Multi-Scale feature P thereof 5 ". Association of P 2 ″,P 4 ″,P 5 "pass through 1 × 1 convolution layer to predict all human semantic regions, and divide to obtain coarse analysis resultθ represents a 1 × 1 convolution, see fig. 3 (b).
Step3, affine transformation is firstly performed on all the detected human body candidate multi-region detection frames obtained in Step1 in D.
Wherein beta is 1 ,β 2 ,β 3 Is a parameter vector.Andthe coordinates before and after transformation, respectively.
Then, as shown in table 2, 18 joint points are defined for each body part of the human body, and each of the transformed human bodies is input to the single posture estimation module. Generating joint heat mapsAs shown in fig. 4 (b). Then two joint types are defined, which are respectively target jointsAnd interference jointWhereinBy defining a loss functionTo inhibit interfering joints, where RSME is the root mean square error function, K denotes the kth joint of the v-th individual, and K is the total number of joints. During training, joints which do not belong to the human body example in each detection box are restrained by using the defined loss function, and misconnection is reduced.
TABLE 2
Body part | Joint point |
Head part | Joint of head and neck, left eye and right eye |
Upper limb | The left and right elbow joints, the left and right wrist joints |
Lower limbs | Left and right ankle joint, left and right knee joint |
Trunk | Left and right shoulder joints, left and right hip joints, pelvic joints, spinal joints |
Then, defining a human body posture association rule, performing association connection on all generated joint points, and generating a skeleton labeling graph A ═ { A ═ A } L L denotes the number of generated connections. And defining a pose similarity function to eliminate similar poses: f (A) x ,A y ,η)=1[d(A x ,A y |Λ,λ)≤η]Where d (-) is a defined pose distance function, Λ, λ is a set of parameters, and η is a distance threshold. If d ≦ η, determine A x For redundancy, elimination is performed. d (-) is in particular d (A) x ,A y |Λ)=Q(A x ,A y |η 1 )+μH(A x ,A y |η 2 ) And calculating the confidence coefficient of the matched joint points between the posture skeletons by the function Q to obtain the number of the matched joints between the postures. The function H calculates the spatial distance, η, between the joints 1 ,η 2 Is a tunable parameter in the function, μ is a weight that balances two distances, Λ ═ η 1 ,η 2 μ }. The specific formula is as follows:
whereinTo representThe center of the detection frame is detected,indicates the position of the nth joint,representing the confidence score for the location of the nth joint.
Finally, original image coordinates are restored through inverse affine transformation, and an accurate attitude estimation result F is generated through refinement pose As shown in fig. 4 (c).
As shown in tables 1 and 3 and fig. 4, after Step3, an accurate attitude estimation result F can be obtained pose And taking the human body candidate multi-region detection box generated by Step1 as an input, performing the second stage of training, and keeping the optimal training parameters of the two stages so as to be convenient for fine tuning during the last stage of training. The invention can obtain a detection result with higher precision, which is shown in table 1. Meanwhile, based on the detection frame, a human joint heat map is obtained, as shown in fig. 4 (b).
For multi-person pose estimation, the invention has higher precision, and table 3 shows that the example is compared with other models typically used in the known method, such as MASK RCNN, RMPE, HR-Net, and openpos, wherein ap (average precision) is the average accuracy rate used for calculating the precision percentage of the test set; OKS (object keypoint similarity) is the similarity of key points, and the similarity is calculated by adding the Euclidean distance of a scale, and is mainly used in a multi-person posture estimation task. The specific formula is as follows:
where v denotes someone in the GT and v k A key point representing a person's presence,representing the euclidean distance of the currently detected set of keypoints to keypoints of the same id in the GT,the visibility of this key point is 1, i.e. the key point is unobstructed and markedIndicating that the key point is occluded but marked, S v Scale factor representing the human in GT, whose value is the square root of the area of the detection box, σ v And d represents a normalization factor of the key point with v, and delta (x) represents 1 if x is true, and 0 otherwise. T is a manually set threshold, from table 3 we take 50,75, mean M and minimum L.
TABLE 3
Method | AP | AP oks=50 | AP oks=75 | AP M | AP L |
Mask RCNN | 62.9 | 87.6 | 67.9 | 57.5 | 71.3 |
RMPE | 72.1 | 88.8 | 79.1 | 68.1 | 77.9 |
HR-Net | 75.5 | 92.4 | 83.3 | 71.9 | 81.5 |
Openpose | 61.8 | 84.9 | 67.5 | 57.1 | 68.2 |
The invention | 76.4 | 90.1 | 84.5 | 72.3 | 83.1 |
Step4, first, the accurate attitude estimation result F pose FIG. 4(c) and the crude analysis result F obtained at Step2 parsing (FIG. 5(b)) a cascade is performed, segmenting the labels of different human instances by human multi-region candidate boxes and pose constraints.
Then by defining a semantic spatial distance loss, L f_parsing =λ 1 L parsing +L parsing L pose +λ 2 L pose To reduce the gap between different semantics. Wherein λ 1 Is to resolve the loss weight, λ 2 It is the weight loss for the pose that,the loss of the coarse resolution is shown as,representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is i The category of the ith pixel point is represented; ln (f) m ) Representing the probability of prediction as the mth type of semantics; n is the total number of joints;respectively representing the pixel coordinates and the real coordinates of the nth joint in the image.
And finally, mapping the cascaded result into NxC size through convolution, mapping the coarse analysis result into CxN size, fusing and inputting the fused result into 3 convolution layers with 7 x 7 dimensionalities of 128 for thinning, and obtaining the final accurate analysis result.
As shown in fig. 6, the final accurate analysis result (c) is obtained after Step 4. The space semantic distance loss function is introduced to carry out the training of the third stage through the rough analysis result (b) and the posture estimation result (a) obtained by the training of the first two stages, the comparison of the example and other typical example-level human body analysis models such as NAN, M-CE2P and RP-R-CNN in the known method under the same data set is shown in the table 4, and the comparison can be seenThe indexes of the invention are superior to other known methods. Wherein the mIOU calculates the ratio of intersection and union of the two sets of the real value and the predicted value, and the ratio can be transformed into the sum (union) of TP, FP and FN on the ratio of TP (intersection), and the formula is that mIOU is TP/(FP + FN + TP); PCP 50 As a percentage of correct sites, if the distance between the predicted tag position and the true value is less than half of the total pixels (usually expressed as PCP @0.5), then it is considered to be correctly detected; AP (Access Point) P Is the average precision based on the part.
TABLE 4
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (4)
1. A precise human body analysis method for crowds is characterized in that: the method comprises the following steps:
step1, inputting a crowded crowd image set, extracting a coarse image layering characteristic and a super-pixel characteristic through a depth residual error network, performing characteristic representation on a human body image to obtain a foreground accurate semantic characteristic diagram, and generating a human body candidate multi-region detection frame;
step2, sampling the foreground accurate semantic feature map to the same size, fusing the foreground accurate semantic feature map and the foreground accurate semantic feature map to generate high-resolution features, and obtaining a human body coarse analysis result through preliminary analysis;
step3, carrying out human body posture estimation on the foreground accurate semantic features in the candidate multi-region detection frame to generate human body joint points, and refining to obtain a multi-person accurate posture estimation result;
and Step4, performing joint optimization on the obtained human body coarse analysis result and the multi-person accurate posture estimation result by calculating the semantic distance loss, and outputting a final accurate human body analysis result.
2. The accurate human body analysis method for crowded people according to claim 1, wherein: the specific process of Step1 is as follows:
first, a hierarchical feature P ═ P { P } is extracted from an input crowded population image set G by using ResNet101 1 ,P 2 ,P 3 ,P 4 ,P 5 And generating a superpixel partition series S ═ S using COB convolution-oriented boundaries 0 ,S 1 ,...,S N In which S is N Is a super pixel, S, representing the entire image N Is S N-1 In two superpixel combinations, match P 2 ,P 4 ,P 5 Size of (d) is selected as a subset N ═ N in S 2 ,N 4 ,N 5 -wherein the number of inter-adjacent-level nodes is 1/4 times;
then, for P 2 ,P 4 ,P 5 Performing feature mappingTo map to a graph matrix, where W is the readable weight matrix of the fully connected layer, | | | refers to the concatenation operation, Δ min (P l e ) And Δ max (P l e ) Respectively representing minimum pooling and maximum pooling, P l e A grid cell representing a joint corresponding hierarchical superpixel partition;
then, extracting context and hierarchical information of mapping characteristics through a graph neural network, combining characteristic pyramid decoding characteristics for fusion, adding space and channel attention in the graph neural network for reducing redundant calculation, and obtaining a final characteristic representation result; given mapping node i and a set of neighboring nodes C i The spatial attention of node i is expressed as:wherein, M is a self-attention head,is node i neighbor node collectionThe sum of the feature vector sets of (a); the channel attention is shown as Mean of the eigenvectors representing node i and its neighbors, σ represents the fully connected layer of Sigmod activation, andperforming element multiplication, attention is finally expressed asWhere β is the scale weight initialized to 0;
finally, obtaining the foreground accurate semantic feature F based on the feature representation result f ={P′ u L 1 is more than or equal to u is less than or equal to 5, the feature representation result is input into the hierarchical cascade RPN to obtain a candidate region, and a human body candidate multi-region detection frame D is generated through classification and regression prediction v V represents the number of people in the graph.
3. The accurate human body analysis method for crowded people according to claim 1, wherein: the specific process of Step3 is as follows:
firstly, carrying out affine transformation on all detected human body candidate multi-region detection frames D in Step 1;
then, each transformed human body is respectively input into a single posture estimation module to generate a joint heat mapAnd defining two joint types, which are respectively target jointsAnd interference jointWhereinDefinition lossTo inhibit interfering joints, wherein RSME is a root mean square error function, K denotes the kth joint of the v-th individual, K is the total number of joints;
then, defining a human body posture association rule, performing association connection on all generated joint points, and generating a skeleton labeling graph A ═ { A ═ A } L L denotes the number of generated connections and defines a pose similarity function f (A) x ,A y ,η)=1[d(A x ,A y |Λ,λ)≤η]Eliminating similar gestures, wherein d (-) is a defined gesture distance function, Λ, λ is a parameter set, and η is a distance threshold; if d ≦ η, determine A x For redundancy, elimination is performed; d (-) is in particular d (A) x ,A y |Λ)=Q(A x ,A y |η 1 )+μH(A x ,A y |η 2 ) Wherein function Q calculates confidence of the matched joint points between the posture skeletons to obtain the number of the matched joints between the postures, and function H calculates the space distance, eta, between the joints 1 ,η 2 Is a tunable parameter in the function, μ is a weight that balances two distances, Λ ═ η 1 ,η 2 ,μ};
Finally, original image coordinates are restored through inverse affine transformation, and an accurate attitude estimation result F is generated through refinement pose 。
4. The method of claim 1, wherein the method comprises: the specific process of Step4 is as follows:
firstly, an accurate attitude estimation result F is obtained pose And the coarse analysis result F obtained in Step2 parsing Cascading, namely segmenting different human body example labels through human body multi-region candidate frames and posture constraints;
then defining a semantic spaceDistance loss L f_parsing =λ 1 L parsing +L parsing L pose +λ 2 L pose To reduce the gap between different semantics; wherein λ 1 Is to resolve the loss weight, λ 2 It is the weight loss for the pose that,the loss of the coarse resolution is shown,representing the attitude estimation loss, wherein a represents the total number of pixel points in the image; m represents the number of label categories; 1 represents a first type of cross-entropy loss; y is i The category of the ith pixel point is represented; ln (f) m ) Representing the probability of prediction as an m-th class of semantics; n is the total number of joint points;respectively representing the pixel coordinates and the real coordinates of the nth joint in the image;
and finally, mapping the cascaded result into NxC size through convolution, mapping the analysis result into CxN size, fusing, inputting into 3 convolution layers with 7 x 7 dimensionalities of 128 for thinning, and obtaining the final accurate analysis result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111645897.XA CN114973305B (en) | 2021-12-30 | 2021-12-30 | Accurate human body analysis method for crowded people |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111645897.XA CN114973305B (en) | 2021-12-30 | 2021-12-30 | Accurate human body analysis method for crowded people |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114973305A true CN114973305A (en) | 2022-08-30 |
CN114973305B CN114973305B (en) | 2023-03-28 |
Family
ID=82975210
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111645897.XA Active CN114973305B (en) | 2021-12-30 | 2021-12-30 | Accurate human body analysis method for crowded people |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114973305B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115205906A (en) * | 2022-09-15 | 2022-10-18 | 山东能源数智云科技有限公司 | Method, device and medium for detecting warehousing operation personnel based on human body analysis |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102194105A (en) * | 2010-03-19 | 2011-09-21 | 微软公司 | Proxy training data for human body tracking |
CN103164858A (en) * | 2013-03-20 | 2013-06-19 | 浙江大学 | Adhered crowd segmenting and tracking methods based on superpixel and graph model |
EP3016027A2 (en) * | 2014-10-30 | 2016-05-04 | Panasonic Intellectual Property Management Co., Ltd. | Human body part detection system and human body part detection method |
CN108564012A (en) * | 2018-03-29 | 2018-09-21 | 北京工业大学 | A kind of pedestrian's analytic method based on characteristics of human body's distribution |
US20180300540A1 (en) * | 2017-04-14 | 2018-10-18 | Koninklijke Philips N.V. | Person identification systems and methods |
CN111062274A (en) * | 2019-12-02 | 2020-04-24 | 汇纳科技股份有限公司 | Context-aware embedded crowd counting method, system, medium, and electronic device |
CN111339903A (en) * | 2020-02-21 | 2020-06-26 | 河北工业大学 | Multi-person human body posture estimation method |
US20210082136A1 (en) * | 2018-12-04 | 2021-03-18 | Yoti Holding Limited | Extracting information from images |
CN113111848A (en) * | 2021-04-29 | 2021-07-13 | 东南大学 | Human body image analysis method based on multi-scale features |
CN113592893A (en) * | 2021-08-29 | 2021-11-02 | 浙江工业大学 | Image foreground segmentation method combining determined main body and refined edge |
CN113673327A (en) * | 2021-07-14 | 2021-11-19 | 南京邮电大学 | Penalty ball hit prediction method based on human body posture estimation |
CN113723255A (en) * | 2021-08-24 | 2021-11-30 | 中国地质大学(武汉) | Hyperspectral image classification method and storage medium |
-
2021
- 2021-12-30 CN CN202111645897.XA patent/CN114973305B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102194105A (en) * | 2010-03-19 | 2011-09-21 | 微软公司 | Proxy training data for human body tracking |
CN103164858A (en) * | 2013-03-20 | 2013-06-19 | 浙江大学 | Adhered crowd segmenting and tracking methods based on superpixel and graph model |
EP3016027A2 (en) * | 2014-10-30 | 2016-05-04 | Panasonic Intellectual Property Management Co., Ltd. | Human body part detection system and human body part detection method |
US20180300540A1 (en) * | 2017-04-14 | 2018-10-18 | Koninklijke Philips N.V. | Person identification systems and methods |
CN108564012A (en) * | 2018-03-29 | 2018-09-21 | 北京工业大学 | A kind of pedestrian's analytic method based on characteristics of human body's distribution |
US20210082136A1 (en) * | 2018-12-04 | 2021-03-18 | Yoti Holding Limited | Extracting information from images |
CN111062274A (en) * | 2019-12-02 | 2020-04-24 | 汇纳科技股份有限公司 | Context-aware embedded crowd counting method, system, medium, and electronic device |
CN111339903A (en) * | 2020-02-21 | 2020-06-26 | 河北工业大学 | Multi-person human body posture estimation method |
CN113111848A (en) * | 2021-04-29 | 2021-07-13 | 东南大学 | Human body image analysis method based on multi-scale features |
CN113673327A (en) * | 2021-07-14 | 2021-11-19 | 南京邮电大学 | Penalty ball hit prediction method based on human body posture estimation |
CN113723255A (en) * | 2021-08-24 | 2021-11-30 | 中国地质大学(武汉) | Hyperspectral image classification method and storage medium |
CN113592893A (en) * | 2021-08-29 | 2021-11-02 | 浙江工业大学 | Image foreground segmentation method combining determined main body and refined edge |
Non-Patent Citations (5)
Title |
---|
JIAN ZHAO,JIANSHU LI: "Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing", 《PROCEEDINGS OF THE 26TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》 * |
JIANSHU LI: "Multi-Human Parsing in the Wild", 《CSDN》 * |
THOMAS GOLDA: "Human Pose Estimation for Real-World Crowded Scenarios", 《腾讯云开发者社区》 * |
甘霖: "面向二维虚拟试衣的着装人体解析方法研究", 《昆明理工大学》 * |
甘霖; 刘骊; 刘利军; 付晓东; 黄青松: "结合边缘轮廓和姿态特征的人体精确解析模型", 《计算机辅助设计与图形学学报》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115205906A (en) * | 2022-09-15 | 2022-10-18 | 山东能源数智云科技有限公司 | Method, device and medium for detecting warehousing operation personnel based on human body analysis |
Also Published As
Publication number | Publication date |
---|---|
CN114973305B (en) | 2023-03-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111339903B (en) | Multi-person human body posture estimation method | |
CN111476181B (en) | Human skeleton action recognition method | |
CN108154118B (en) | A kind of target detection system and method based on adaptive combined filter and multistage detection | |
CN107609460B (en) | Human body behavior recognition method integrating space-time dual network flow and attention mechanism | |
CN111310659B (en) | Human body action recognition method based on enhanced graph convolution neural network | |
WO2020108362A1 (en) | Body posture detection method, apparatus and device, and storage medium | |
Yang et al. | Extraction of 2d motion trajectories and its application to hand gesture recognition | |
Zhang et al. | Deep hierarchical guidance and regularization learning for end-to-end depth estimation | |
Asif et al. | A multi-modal, discriminative and spatially invariant CNN for RGB-D object labeling | |
CN113408455B (en) | Action identification method, system and storage medium based on multi-stream information enhanced graph convolution network | |
CN108960184B (en) | Pedestrian re-identification method based on heterogeneous component deep neural network | |
CN112200111A (en) | Global and local feature fused occlusion robust pedestrian re-identification method | |
CN112307995B (en) | Semi-supervised pedestrian re-identification method based on feature decoupling learning | |
CN112347861B (en) | Human body posture estimation method based on motion feature constraint | |
Sincan et al. | Using motion history images with 3d convolutional networks in isolated sign language recognition | |
CN111310668B (en) | Gait recognition method based on skeleton information | |
Li et al. | Effective person re-identification by self-attention model guided feature learning | |
Zhou et al. | Face parsing via a fully-convolutional continuous CRF neural network | |
CN113052017B (en) | Unsupervised pedestrian re-identification method based on multi-granularity feature representation and domain self-adaptive learning | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
Rani et al. | An effectual classical dance pose estimation and classification system employing convolution neural network–long shortterm memory (CNN-LSTM) network for video sequences | |
CN114973305B (en) | Accurate human body analysis method for crowded people | |
CN114187653A (en) | Behavior identification method based on multi-stream fusion graph convolution network | |
Liu et al. | Bayesian inferred self-attentive aggregation for multi-shot person re-identification | |
Lei et al. | Continuous action recognition based on hybrid CNN-LDCRF model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |