CN111898566A - Attitude estimation method, attitude estimation device, electronic equipment and storage medium - Google Patents

Attitude estimation method, attitude estimation device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111898566A
CN111898566A CN202010771698.2A CN202010771698A CN111898566A CN 111898566 A CN111898566 A CN 111898566A CN 202010771698 A CN202010771698 A CN 202010771698A CN 111898566 A CN111898566 A CN 111898566A
Authority
CN
China
Prior art keywords
joint
target
joints
candidate
target joint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010771698.2A
Other languages
Chinese (zh)
Other versions
CN111898566B (en
Inventor
高联丽
代燕
王轩瀚
宋井宽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Jingzhili Technology Co ltd
University of Electronic Science and Technology of China
Original Assignee
Chengdu Jingzhili Technology Co ltd
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Jingzhili Technology Co ltd, University of Electronic Science and Technology of China filed Critical Chengdu Jingzhili Technology Co ltd
Priority to CN202010771698.2A priority Critical patent/CN111898566B/en
Publication of CN111898566A publication Critical patent/CN111898566A/en
Application granted granted Critical
Publication of CN111898566B publication Critical patent/CN111898566B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a posture estimation method, a posture estimation device, electronic equipment and a storage medium, and aims to solve the technical problem of improving the posture estimation accuracy in crowded scenes. The method comprises the following steps: extracting visual features from an area image defined by the pedestrian detection frame; identifying all joints in the area image according to the visual features and establishing a candidate joint set; evaluating all joints in the candidate joint set and obtaining target joint information of a target pedestrian example in the area image; and generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance. All joints in the image of the area are identified through the extracted visual features, a candidate joint set is established, at the moment, the candidate joint set comprises a target joint and an interference joint, all joints in the candidate joint set are evaluated, and target joint information of a target pedestrian instance in the image of the area is obtained, so that the attitude estimation accuracy in a crowded scene is improved.

Description

Attitude estimation method, attitude estimation device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of image processing, in particular to a posture estimation method, a posture estimation device, electronic equipment and a storage medium.
Background
Human pose estimation is a fundamental and challenging problem in computer vision, and aims to accurately identify the positions of multiple human bodies and sparse joint positions on a skeleton from a single RGB image. With the application of deep Convolutional Neural Networks (CNNs) and the release of large-scale data sets such as MSCOCO, the attitude estimation method has been greatly developed. They can be roughly classified into bottom-up (i.e., bottom-up, the same below) and top-down (i.e., top-down, the same below) methods. For the bottom-up approach, all human joints are detected first and then grouped into different human instances, and the problem is mostly focused on how to group candidate joints into a single human instance. For the top-down method, the thinking is just opposite, all human body examples are firstly positioned, then attitude estimation is carried out on each pedestrian, and the method mainly focuses on how to design more efficient single person attitude estimation (SPPE). Compared with a bottom-up method which does not need to detect human body examples, the top-down method generally has better attitude estimation performance but lower inference speed.
Although the existing top-down pose estimation method performs better in a simple scene, a great challenge is still faced in a crowded scene. By "crowded scene," it is meant that the RGB image captures a complex real-world scene with highly overlapping pedestrians, severe occlusions, different poses, and multi-scale variations. Aiming at a crowded scene, the existing top-down attitude estimation method has the following two technical problems:
1) the pedestrian detection frame includes a plurality of joints. The current top-down method assumes that each detected pedestrian instance contains only the joints belonging to the target pedestrian, i.e., the target joints. However, crowded scenes typically contain highly occluded or overlapping pedestrians, which means that the generated pedestrian detection frame contains, in addition to the target joint, joints belonging to other pedestrian instances, i.e. interfering joints. Based on the above assumptions, the conventional top-down method may assign different pedestrian labels to the joints of the same person, which may result in irreversible errors once the interfering joint is determined as the target joint. In addition, these interfering joints will also be highly likely to be regarded as target joints of other pedestrians, so the present embodiment cannot excessively suppress the interfering joints while enhancing the target joint response. Since the interfering joint can greatly obscure the prediction of the target joint, it is a very challenging technical problem how to eliminate the interfering joint from a given pedestrian detection.
2) Blurred joints in crowded scenes. The traditional top-down posture estimation method is highly dependent on the extraction of posture visual features, so that the posture visual features extracted from the region image only contain visual appearance and lack the prior knowledge of human body structure. The pose estimator may fail when faced with blurred joints caused by crowded scenes, such as joints that are not visible due to severe occlusion, or joints with highly similar visual appearance. However, humans can well estimate such blurred joints by looking at the conditions of the surrounding area. For example, based on the reasoning ability of human common sense, one can easily infer the location of the "neck" after seeing the "head" and "shoulders". Therefore, another key technical problem is how to embed the modeling capability of common sense knowledge into the current pose estimation method.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: provided are a posture estimation method, a posture estimation device, an electronic device and a storage medium, so as to improve the accuracy of posture estimation in crowded scenes.
The technical scheme adopted by the invention for solving the technical problems is as follows: an attitude estimation method, comprising: extracting visual features from an area image defined by the pedestrian detection frame; identifying all joints in the area image according to the visual features and establishing a candidate joint set; evaluating all joints in the candidate joint set and obtaining target joint information of a target pedestrian example in the area image; and generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance.
According to the attitude estimation method, all joints in the image of the area are identified through the extracted visual features, the candidate joint set is established, at the moment, the candidate joint set comprises the target joint and the interference joint, then all joints in the candidate joint set are evaluated, and target joint information of a target pedestrian instance in the image of the area is obtained, so that the problem that different pedestrian labels may be allocated to the joints of the same person by a traditional top-down method is solved, and the attitude estimation accuracy in a crowded scene is improved.
According to the inventive embodiments provided in the present specification, the process of generating the target joint estimation result from the target joint information is implemented by a target joint estimator modeled by means of common human knowledge. By introducing the target joint estimator modeled by means of common human knowledge, the target joint estimation can be completed by using the reasoning capability of the common human knowledge, and the attitude estimation accuracy in crowded scenes is further improved.
According to an embodiment of the invention provided in the present specification, the target joint information includes a corrected feature obtained by correcting the visual feature by an attention mechanism that takes as an object a target joint obtained by excluding an interfering joint from the candidate joint set after the evaluation.
According to the embodiment of the invention provided by the present specification, the process of evaluating all the joints in the candidate joint set and obtaining the target joint information of the target pedestrian instance in the area image includes modeling the relationship of all the joints in the candidate joint set and based on this, removing the interfering joint to obtain the target joint information of the target pedestrian instance.
According to one aspect of the invention provided by the present specification, there is provided an apparatus for pose estimation. The device is configured as an artificial neural network, comprising: the visual feature extraction module is used for extracting visual features from the region image defined by the pedestrian detection frame; the candidate joint identification module is used for identifying all joints in the region image according to the visual features and establishing a candidate joint set; the target joint information generation module is used for evaluating all joints in the candidate joint set and acquiring target joint information of a target pedestrian instance in the regional image; and the estimated attitude generating module is used for generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance.
The device for estimating the attitude can identify all joints in the image of the area through the extracted visual features and establish a candidate joint set, at the moment, the candidate joint set comprises a target joint and an interference joint, then all joints in the candidate joint set are evaluated, and target joint information of a target pedestrian instance in the image of the area is obtained, so that the problem that different pedestrian labels may be allocated to the joints of the same person by a traditional top-down method is avoided, and the attitude estimation accuracy in a crowded scene is improved.
According to an embodiment of the invention provided in this specification, the estimated pose generation module comprises a target joint estimator modeled by means of human common sense. Similarly, by introducing the target joint estimator modeled by means of common human knowledge, the inference capability of the common human knowledge can be utilized to help complete target joint estimation, and the attitude estimation accuracy in crowded scenes is further improved.
According to an embodiment of the invention provided by the present specification, the target joint information generation module includes a visual feature correction means that corrects a corrected feature obtained after the visual feature by an attention mechanism that takes as an object of attention a target joint obtained by excluding an interfering joint from the candidate joint set after the evaluation.
According to an embodiment of the invention provided by the present specification, the candidate joint identification module comprises a multi-joint thermodynamic diagram generation means for generating a thermodynamic diagram of all joints according to the extracted visual features; the estimated posture generation module comprises a target joint thermodynamic diagram generation device, and the target joint thermodynamic diagram generation device is used for generating a target joint thermodynamic diagram according to a target joint estimation result; the training of the artificial neural network for pose estimation is performed by enhancing the target joint response in the target joint thermodynamic diagram while ensuring that all joints in the multi-joint thermodynamic diagram are in an activated state. Thus, the apparatus for pose estimation may achieve end-to-end training.
According to one aspect of the invention provided by the present specification, an electronic device for pose estimation is provided. The apparatus comprises: a processor; a memory for storing processor-executable instructions; the processor is configured to perform any of the attitude estimation methods described above.
According to an aspect of the invention provided in the present specification, there is also provided a computer-readable storage medium, including a stored program, which when executed performs any one of the above-described attitude estimation methods.
The above-described attitude estimation method, apparatus for attitude estimation, electronic device for attitude estimation, and program stored on a computer-readable storage medium implement a new strategy for attitude estimation. Specifically, the strategy identifies all joints in the area image through the extracted visual features and establishes a candidate joint set, at the moment, the candidate joint set comprises a target joint and an interference joint, then all joints in the candidate joint set are evaluated, and target joint information of a target pedestrian instance in the area image is obtained, so that the problem that a traditional top-down method possibly allocates different pedestrian labels to the joints of the same person is solved, and the attitude estimation accuracy in a crowded scene is improved.
The embodiments of the invention provided in the present specification will be further described with reference to the accompanying drawings and detailed description. Additional aspects and advantages of the invention provided by this specification will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention provided by this specification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to assist in understanding the invention provided herein and, together with the description of the invention provided herein, serve to explain, without limitation, the invention provided herein. In the drawings:
fig. 1 is a schematic flow chart of an embodiment of an attitude estimation method provided in this specification.
Fig. 2 is a block diagram of an artificial neural network corresponding to an embodiment of the pose estimation method provided in the present specification.
Fig. 3 is a frame diagram of an artificial neural network corresponding to a multi-joint relationship analysis link in an embodiment of a posture estimation method provided in the present specification.
Fig. 4 is a framework diagram of an artificial neural network corresponding to a joint refinement segment in an embodiment of a posture estimation method provided in this specification.
Detailed Description
The invention provided in this specification will be described more clearly and completely with reference to the accompanying drawings. The person skilled in the art will be able to carry out the invention provided in this description on the basis of these descriptions. Before the invention provided in this specification is explained with reference to the drawings, it should be particularly pointed out that:
in the invention provided in the present specification, the technical solutions and the technical features provided in the respective portions including the following description may be combined with each other without conflict.
In addition, the embodiments of the invention provided in the present specification and referred to in the following description are generally only a part of the embodiments of the invention provided in the present specification and not all of the embodiments, and therefore, all other embodiments obtained by those skilled in the art without any inventive work based on the embodiments of the invention provided in the present specification should fall within the scope of the invention protection provided in the present specification.
With respect to the terms and units in the invention creation provided in this specification: the terms "comprising," "including," "having," and any variations thereof in the description and claims and any related parts of the inventive subject matter presented in this specification are intended to cover non-exclusive inclusions. In addition, other related terms and units in the invention and creations provided by the specification can be reasonably interpreted based on the related contents of the invention and creations provided by the specification.
Fig. 1 is a schematic flow chart of an embodiment of an attitude estimation method provided in this specification. As shown in fig. 1, the attitude estimation method includes: step S001, extracting visual features from an area image defined by a pedestrian detection frame; s002, identifying all joints in the area image according to the visual features and establishing a candidate joint set; s003, evaluating all joints in the candidate joint set to obtain target joint information of a target pedestrian instance in the area image; and step S004, generating a target joint estimation result according to the target joint information so as to generate an estimation attitude corresponding to the target pedestrian instance.
The above-described attitude estimation method is implemented by means of an apparatus for attitude estimation, which is configured as an artificial neural network, including: the visual feature extraction module is used for extracting visual features from the region image defined by the pedestrian detection frame; the candidate joint identification module is used for identifying all joints in the region image according to the visual features and establishing a candidate joint set; the target joint information generation module is used for evaluating all joints in the candidate joint set and acquiring target joint information of a target pedestrian instance in the regional image; and the estimated attitude generating module is used for generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance.
Fig. 2 is a block diagram of an artificial neural network corresponding to an embodiment of the pose estimation method provided in the present specification. Fig. 3 is a frame diagram of an artificial neural network corresponding to some links in the embodiment of the posture estimation method provided in this specification. Fig. 4 is a frame diagram of an artificial neural network corresponding to some links in the embodiment of the posture estimation method provided in this specification. The above-described attitude estimation method and apparatus for attitude estimation will now be further described with reference to fig. 2-4.
Extraction and multi-joint prediction of visual features of pedestrian detection frame
The pedestrian detection frame is obtained by a pedestrian detector. The pedestrian detector is a conventional technology for detecting all pedestrian instances from an input image and assigning a pedestrian detection frame to each pedestrian instance, and the area image defined in each pedestrian detection frame is mainly displayed as a corresponding target pedestrian instance. In the above-mentioned posture estimation method, the pedestrian detection frame is received from the artificial neural network for posture estimation, and the estimated posture of the target pedestrian instance corresponding to the pedestrian detection frame is finally generated as a starting point.
In this embodiment, the step of extracting the visual features from the region image defined by the pedestrian detection frame may be implemented by the visual encoder 101 based on Convolutional Neural Networks (CNNs). The visual encoder 101 belongs to the visual feature extraction module.
Preferably, the visual feature f of the area image I defined by the pedestrian detection frame is extractedI∈RH*W*DH represents the height of the area image, W represents the width of the area image, and D represents the dimension of each pixel point in the area image corresponding to the visual feature. The visual encoder 101 is preferably an HRNet encoder (for a detailed description of the HRNet encoder, see "DongLiu Ke Sun, Bin Xiao and Jingdong Wang.2019.deep High-resolution prediction for Human Pose estimation. in CVPR.5693-5703").
In the past, the visual features obtained by the HRNet encoder were used to directly predict the target joint, while in the present embodiment, the visual features obtained by the HRNet encoder were used to predict all joints in the region image defined by the pedestrian detection frame, that is, the target joint and the interfering joint.
Secondly, realizing the identification and evaluation of candidate joints through a multi-joint relation analyzer
After predicting all joints in the pedestrian detection frame based on the visual encoder 101, the embodiment provides a multi-joint relation analyzer, and a relation graph is established among the candidate joint set, so that the interference problem is solved. Specifically, the proposed multi-joint relationship resolver comprises two main parts: (1) a relational encoder; (2) and (5) interference elimination. The following describes the relationship encoder and interference rejection in detail with reference to fig. 3.
(1) Relation encoder
Given the I-th pedestrian detection frame, the present embodiment first looks at its visual characteristic fIEstimating the multi-joint thermodynamic diagram H ═ psim(fI,Wm) Wherein ψmIs a parameter WmIs used as a pixel level multi-joint estimator. Next, a set of N is generated from the multijoint thermodynamic diagram H according to the set thresholdpA candidate joint
Figure RE-GDA0002646554480000061
Wherein P isiRepresenting a candidate joint in the set of candidate joints. The generation of the multi-joint thermodynamic diagram and the generation of the candidate joints belong to all joints in the area image according to the visual features, and a candidate joint set is established.
For each candidate joint PiGiven three types of features
Figure RE-GDA0002646554480000062
Wherein b isi=(Δxi,Δyi,xi,yi) Position information indicating the candidate joint, ciIs the category information of the candidate joint, i.e. the one-hot category characterization of the corresponding C-type joint (e.g. CrowPose defines a 14-type joint). In detail, (x)i,yi) Represents the candidate joint coordinates (Δ x)i,Δyi) Then it is the offset of the candidate joint from the center point of the body. In addition, viVisual information representing the joint candidate, i.e., the pedestrian detection frame region image visual feature fIIn position (x)i,yi) The corresponding joint visual representation of the pixel points. Then, joint pair relation coding is carried out on the candidate joint set, and the joint pair relation coding comprises the following steps: 1) geometric coding
Figure RE-GDA0002646554480000063
2) Class coding
Figure RE-GDA0002646554480000064
3) Visual coding
Figure RE-GDA0002646554480000065
Wherein, beAnd ceThe geometric associations and class semantics of the joint pairs are separately encoded, and veIs a coded modeling of the visual representation relationship of the joint pair, NpRepresenting the size of the set of candidate joints, and db,dcAnd dvRespectively representing the characteristic dimensions after coded modeling of the joint.
Since the geometric information of the joints provides the relative relationship between the joints of the human body and the center of the human body, the present embodiment uses such relative geometric information for geometric encoding. For candidate joint pairs i and j, the geometric relationship encoding can be calculated as:
Figure RE-GDA0002646554480000066
wherein, WbeIs to map the original geometric information to dbScaling parameters of the dimensional feature vectors, biAnd bjJoint position information of the candidate joint pair i and j is represented.
To extract the relationship code from the category information and the visual representation, the present embodiment generates the category relationship code and the visual relationship code for the candidate joint pair i and j using the following formulas:
Figure RE-GDA0002646554480000067
Figure RE-GDA0002646554480000068
where V and U are linear matrices that project the input to the eigenvector and T represents the matrix transpose. Sigma is a non-linear ReLU activation function,
Figure RE-GDA0002646554480000069
between representing matricesElement-by-element multiplication. WceAnd WveFor mapping relational features to high-dimensional feature vectors d, respectivelycAnd dv. Furthermore, ciAnd cjClass information, v, representing the candidate joint pair i and jiAnd vjVisual information of the candidate joint pair i and j is represented.
Next, the relationships of all candidate joint pairs are encoded (geometric relationship encoding b)eClass relation coding ceAnd visual relation coding ve) Cascading together, the relationship characteristics can be obtained
Figure RE-GDA0002646554480000071
Then, the linear function ψ after activation by sigmoida(Er,Wr) Obtaining a joint relation graph between candidate joint pairs
Figure RE-GDA0002646554480000072
Wherein,
Figure RE-GDA0002646554480000073
is a linear transformation parameter, #aIs a parameter WrA relationship estimator of joint pairs. Element A in the joint relation diagrami,jIndicating the likelihood that the candidate joint i and the candidate joint j belong to the same pedestrian instance.
(2) Interference rejection
Given a set of candidate joints P, the goal is to eliminate the interfering joints. Assume that there are C multi-joint thermodynamic diagrams for C joint classes. For the kth joint class, there is NkCandidate joints scored by their thermodynamic diagramskAnd joint relation graph a, re-scoring them, as calculated as follows:
Figure RE-GDA0002646554480000074
S(i)=Sv(i)+Hk(i)
wherein S isv(i) Is the average relationship score of the ith candidate joint for which the traversal pairCandidate set of joints NkOther N in (1)k1 candidate joints to rescale, I [ I ≠ j]Is an indication function, the function output is 1 if and only if the candidate joint i and the candidate joint j represent different candidate joints, and a (i, j) is the corresponding element value of the candidate joint i and the candidate joint j in the joint relation diagram a. Furthermore, Hk(i) Is the potential energy obtained from the kth multi-joint thermodynamic diagram, and s (i) is the final score for the ith candidate joint. In the forward estimation, the joint with the highest score will be set as the target joint of the kth-class joint. For simplicity, a set of target joints is represented as
Figure RE-GDA0002646554480000075
Finally, the target joint attention map AtA joint label is generated by applying a gaussian-based thermodynamic diagram generation method with the target joint.
Third, target joint information generation and utilization
The pose estimator may fail when faced with blurred joints caused by crowded scenes, such as joints that are not visible due to severe occlusion, or joints with highly similar visual appearance. To address this problem, the present embodiment introduces a joint refiner to establish the pose correction mechanism. In particular, the joint refiner modifies the visual features and the multi-joint estimator separately, the process comprising two steps: (1) correcting the posture characteristic; (2) and (5) correcting the attitude estimator. The following describes the relationship encoder and interference rejection in detail with reference to fig. 4.
The aim of posture characteristic correction is to enable a posture estimation model to focus more on the most relevant visual characteristic region, and the aim can be achieved by means of a target joint attention diagram generated by a multi-joint relation analyzer. Specifically, the present embodiment first displays the visual feature f of the image of the pedestrian detection frame regionIAnd corresponding target joint attention map atThe cascade pairs are modified. Then, using a correction network with continuous convolution layers, the most relevant visual signals are extracted from the visual features under the guidance of an attention map, thereby obtaining the corrected visual features fr
After feature modification, the present embodiment requires a robust pose estimator for target joint estimation. In general, one can infer not only the position of the target joint from visual signals, but also their position from common sense. In light of this, the present embodiment constructs a target joint estimator by migrating joint analytic knowledge from a multi-joint estimator by common sense knowledge modeling. Specifically, the present embodiment constructs two main modules in the pose estimator modification: 1) generating a knowledge graph; 2) and (5) knowledge migration.
1) Knowledge graph generation
As previously mentioned, the relevant a priori knowledge can better estimate the position of the body joint. For example, one can easily infer the location of the "neck" after knowing where the "head" is. This is because there is a strong relationship between these two joint classes. In order to model the common sense relationships of such pairs of joint classes, the present embodiment requires the establishment of a knowledge map to provide information on the relationships between the joint classes. The knowledge graph generation mechanism designed by the application comprises the following steps: semantic relationship Ks∈RC×CAnd the common sense relationship Kc∈RC×C. Wherein C represents C different joint classes (for example, CrowPose defines 14 classes of joints), the semantic relationship is a "soft connection" mode (value is between 0 and 1) existing between different joint classes obtained from the language model naturally, and the common sense relationship is a real "hard connection" mode (value is not 0, i.e. 1) between different joint classes.
To construct the semantic relationships, the present embodiment first extracts the semantic embedded vectors of the joints from the language model and then calculates their similarity scores. For example, given an ith joint class and a jth joint class, their semantic similarity scores are calculated as follows:
Figure RE-GDA0002646554480000081
W2V (·) is a Word2Vec function for extracting a semantic embedded vector of a joint from a language model, | | · | | | | represents an euclidean norm of the feature vector, and T represents a matrix transposition. It can be observed from the obtained semantic relationship that if there is connectivity between the two types of joints, their similarity scores are relatively high.
In order to construct a common sense relationship representing the natural connection relationship of joints, this embodiment introduces an identification function I (s, k) whose output is 1 and only if there is a "connection" relationship between two joint classes (for example, if there is a connection between the neck and the head, it is set to 1), and the formula is as follows:
Figure RE-GDA0002646554480000082
therefore, the common sense knowledge map K finally about human body architectureg∈RC×CThe calculation is as follows:
Figure RE-GDA0002646554480000091
wherein,
Figure RE-GDA0002646554480000092
table element by element multiplication. D-dimensional parameters W for a given multi-joint estimatorm∈RC×DAnd the common sense knowledge map K constructed as aboveg∈RC×CIn this embodiment, the matrix multiplier may be used to obtain the related resolution parameter Wk=Kg TWm,→RC×D
2) Knowledge migration
Obtaining the related analysis parameter WkThe present embodiment then requires the conversion of these parameters into parameters for the target joint estimator. To this end, the present embodiment constructs a small network with two linear functions after the LeakyReLU activation function to obtain the parameters W of the target joint estimatort∈RC×DThe calculation is as follows:
Wt=Φ(WkWt 1)Wt 2
wherein, Wt 1∈RD×DAnd Wt 2∈RD×DAre two linear transformation matrices and Φ (·) represents a nonlinear LeakyReLU activation function.
Thus, the parameter W can be set by the parametertFrom the corrected visual characteristics frMore reasonably estimate the target joint thermodynamic diagram O. This is because, unlike the conventional method that only uses the parameters of the target joint to perform target pedestrian attitude estimation, the present application also uses the joint analysis parameters having a connection relationship with the target joint to obtain more accurate target pedestrian attitude estimation. For example, if the "head" of a pedestrian is occluded while the "neck" is visible, the embodiment can use the analytic parameters of the neck to make a more reasonable inference on the head.
Fourth, model training
As described above, the present embodiment has explained the posture estimation network model based on the relational modeling in detail. In order to enable the proposed model to better perform target joint estimation in crowded scenes, the present embodiment designs a corresponding learning target to train model parameters. Given a pedestrian detection frame, inputting the area image thereof into the model of the present embodiment will obtain three types of thermodynamic diagrams: 1) a target joint thermodynamic diagram O; 2) a multijoint thermodynamic diagram H; 3) the joint relation diagram A. Specifically speaking: first, the present embodiment inputs the region image of the pedestrian detection frame into the pose estimation model of the present embodiment to extract the corresponding visual feature. Then, all joints in the region image are identified based on the visual features, i.e., a multi-joint thermodynamic diagram H including the interfering joint and the target joint is generated to ensure that all joints are in an activated state. Then, a candidate joint set is established according to the multi-joint thermodynamic diagram and all joints are evaluated, namely, a joint relation diagram A is generated. And finally, obtaining target joint information of a target pedestrian instance in the area image according to the joint relation graph, and generating a target joint estimation result according to the target joint information so as to generate a corresponding target joint thermodynamic diagram O. The goal of this embodiment is to enhance the target joint response in the target joint thermodynamic diagram O while ensuring that all joints in the multi-joint thermodynamic diagram H are active. To achieve this learning goal, the present embodiment performs supervised learning on the multi-joint thermodynamic diagram and the target joint thermodynamic diagram using Mean Square Error (MSE), and the loss function is defined as follows:
Figure RE-GDA0002646554480000101
Figure RE-GDA0002646554480000102
wherein,
Figure RE-GDA0002646554480000103
and
Figure RE-GDA0002646554480000104
the joint thermodynamic diagrams are a true value target joint thermodynamic diagram and a multi-joint thermodynamic diagram corresponding to C different joint classes respectively. In detail, the present invention is described in detail,
Figure RE-GDA0002646554480000105
only a unimodal gaussian distribution for the target joint is included, which is directly obtainable from the annotation data of the data set. However,
Figure RE-GDA0002646554480000106
contains a multimodal gaussian distribution for the target joint and the interfering joints, and the truth of the multijoint thermodynamic diagram is not given directly in the data set. In order to obtain a real value of the multi-joint thermodynamic diagram corresponding to a certain pedestrian detection frame, firstly, a target joint with a label in the detection frame is collected, then other pedestrian detection frames are traversed, and if the target joint of the pedestrian detection frame is contained in the detection frame, the target joint is labeled as an interference joint of the detection frame.
In addition, in order to obtain better target joint information to generate a better estimated target joint, the present embodiment needs to perform supervised learning on the joint relationship diagram, and the present embodiment adopts the same strategy to calculate the mean square error between the joint relationship diagram and its true value, and the loss function is as follows:
Figure RE-GDA0002646554480000107
wherein,
Figure RE-GDA0002646554480000108
is the truth value of the joint relation chart and has the size of Np*Np(NpThe number of all possible joints in the frame, i.e. including the target joint and the interfering joint, is detected for the pedestrian). Each element Ai,jRepresented by 0 or 1, the value 1 is if and only if the ith and jth joints belong to the same pedestrian instance. Thus, the learning objective for the entire model is calculated as follows:
L=αlt+βlj+θlr
where α, β, and θ are weight hyperparameters corresponding to learning targets, and are all set to 1 in the present embodiment.
Fifth, effect and test
Compared with the prior art, the contribution of the application lies in that:
1) the present embodiment introduces a novel strategy to address multiple joints in the pedestrian detection box, including the target joint and the interfering joint. This is the first attempt to model the relationship of all joints in one pedestrian detection box to eliminate the interfering joints. This new strategy will greatly alleviate confusion in the pose estimation model when dealing with the same joint with completely different pedestrian labels.
2) Inspired by the fact that the positions of the fuzzy joints can be well estimated by a human being by looking at the surrounding area, the embodiment introduces a posture correction method with common sense modeling capability, namely, the posture estimation result is improved by correcting a posture estimator and posture visual characteristics.
3) A large number of experimental results show that the posture estimation network model based on the relational modeling is superior to the current latest posture estimation method, especially on a challenging crowdPose data set.
The contents of the present invention have been explained above. Those skilled in the art will be able to implement the invention based on these teachings. Based on the above disclosure of the present invention, all other preferred embodiments and examples obtained by a person skilled in the art without any inventive step should fall within the scope of protection of the present invention.

Claims (10)

1. An attitude estimation method, characterized by comprising:
extracting visual features from an area image defined by the pedestrian detection frame;
identifying all joints in the area image according to the visual features and establishing a candidate joint set;
evaluating all joints in the candidate joint set and obtaining target joint information of a target pedestrian example in the area image; and
and generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance.
2. The attitude estimation method according to claim 1, characterized in that the process of generating the target joint estimation result from the target joint information is realized by a target joint estimator modeled by means of common human knowledge.
3. The attitude estimation method according to claim 1, characterized in that: the target joint information includes a corrected feature obtained by correcting the visual feature by an attention mechanism that takes as an object of interest a target joint obtained by excluding an interfering joint from the candidate joint set after the evaluation.
4. The attitude estimation method according to claim 1, characterized in that: the process of evaluating all joints in the candidate joint set and obtaining the target joint information of the target pedestrian instance in the area image comprises the steps of modeling the relation of all the joints in the candidate joint set and removing interference joints according to the modeling so as to obtain the target joint information of the target pedestrian instance.
5. An apparatus for pose estimation, configured as an artificial neural network, comprising:
the visual feature extraction module is used for extracting visual features from the region image defined by the pedestrian detection frame;
the candidate joint identification module is used for identifying all joints in the region image according to the visual features and establishing a candidate joint set;
the target joint information generation module is used for evaluating all joints in the candidate joint set and acquiring target joint information of a target pedestrian instance in the regional image; and
and the estimated attitude generating module is used for generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance.
6. The apparatus for pose estimation according to claim 5, wherein: the estimated pose generation module includes a target joint estimator modeled with human common sense.
7. The apparatus for pose estimation according to claim 5, wherein: the target joint information generation module includes a visual feature correction means that corrects a correction feature obtained after the visual feature is corrected by an attention mechanism that takes as an object of attention a target joint obtained by excluding an interfering joint from the candidate joint set after the evaluation.
8. The apparatus for pose estimation according to claim 5, wherein:
the candidate joint identification module comprises a multi-joint thermodynamic diagram generation means for generating a thermodynamic diagram of all joints from the extracted visual features;
the estimated posture generation module comprises a target joint thermodynamic diagram generation device, and the target joint thermodynamic diagram generation device is used for generating a target joint thermodynamic diagram according to a target joint estimation result;
the training of the artificial neural network for pose estimation is performed by enhancing the target joint response in the target joint thermodynamic diagram while ensuring that all joints in the multi-joint thermodynamic diagram are in an activated state.
9. An electronic device for pose estimation, characterized by: the method comprises the following steps:
a processor;
a memory for storing processor-executable instructions;
the processor is configured to perform the pose estimation method of any of claims 1-4.
10. A computer-readable storage medium, characterized in that: comprising a stored program which when executed performs the pose estimation method of any of claims 1-4.
CN202010771698.2A 2020-08-04 2020-08-04 Attitude estimation method, attitude estimation device, electronic equipment and storage medium Active CN111898566B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010771698.2A CN111898566B (en) 2020-08-04 2020-08-04 Attitude estimation method, attitude estimation device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010771698.2A CN111898566B (en) 2020-08-04 2020-08-04 Attitude estimation method, attitude estimation device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111898566A true CN111898566A (en) 2020-11-06
CN111898566B CN111898566B (en) 2023-02-03

Family

ID=73184123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010771698.2A Active CN111898566B (en) 2020-08-04 2020-08-04 Attitude estimation method, attitude estimation device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111898566B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966519A (en) * 2021-02-01 2021-06-15 湖南大学 Method, system and storage medium for positioning reference phrase
CN113221824A (en) * 2021-05-31 2021-08-06 之江实验室 Human body posture recognition method based on individual model generation
CN116824631A (en) * 2023-06-14 2023-09-29 西南交通大学 Attitude estimation method and system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160086350A1 (en) * 2014-09-22 2016-03-24 Foundation for Research and Technology - Hellas (FORTH) (acting through its Institute of Computer Apparatuses, methods and systems for recovering a 3-dimensional skeletal model of the human body
WO2018146546A1 (en) * 2017-02-07 2018-08-16 Mindmaze Holding Sa Systems, methods, and apparatuses for tracking a body or portions thereof
CN108647663A (en) * 2018-05-17 2018-10-12 西安电子科技大学 Estimation method of human posture based on deep learning and multi-level graph structure model
CN109271933A (en) * 2018-09-17 2019-01-25 北京航空航天大学青岛研究院 The method for carrying out 3 D human body Attitude estimation based on video flowing
CN109376571A (en) * 2018-08-03 2019-02-22 西安电子科技大学 Estimation method of human posture based on deformation convolution
CN110008915A (en) * 2019-04-11 2019-07-12 电子科技大学 The system and method for dense human body attitude estimation is carried out based on mask-RCNN
CN110084156A (en) * 2019-04-12 2019-08-02 中南大学 A kind of gait feature abstracting method and pedestrian's personal identification method based on gait feature
CN110163157A (en) * 2019-05-24 2019-08-23 南京邮电大学 A method of more people's Attitude estimations are carried out using novel loss function
CN110610154A (en) * 2019-09-10 2019-12-24 北京迈格威科技有限公司 Behavior recognition method and apparatus, computer device, and storage medium
CN110728209A (en) * 2019-09-24 2020-01-24 腾讯科技(深圳)有限公司 Gesture recognition method and device, electronic equipment and storage medium
CN110969124A (en) * 2019-12-02 2020-04-07 重庆邮电大学 Two-dimensional human body posture estimation method and system based on lightweight multi-branch network

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160086350A1 (en) * 2014-09-22 2016-03-24 Foundation for Research and Technology - Hellas (FORTH) (acting through its Institute of Computer Apparatuses, methods and systems for recovering a 3-dimensional skeletal model of the human body
WO2018146546A1 (en) * 2017-02-07 2018-08-16 Mindmaze Holding Sa Systems, methods, and apparatuses for tracking a body or portions thereof
CN108647663A (en) * 2018-05-17 2018-10-12 西安电子科技大学 Estimation method of human posture based on deep learning and multi-level graph structure model
CN109376571A (en) * 2018-08-03 2019-02-22 西安电子科技大学 Estimation method of human posture based on deformation convolution
CN109271933A (en) * 2018-09-17 2019-01-25 北京航空航天大学青岛研究院 The method for carrying out 3 D human body Attitude estimation based on video flowing
CN110008915A (en) * 2019-04-11 2019-07-12 电子科技大学 The system and method for dense human body attitude estimation is carried out based on mask-RCNN
CN110084156A (en) * 2019-04-12 2019-08-02 中南大学 A kind of gait feature abstracting method and pedestrian's personal identification method based on gait feature
CN110163157A (en) * 2019-05-24 2019-08-23 南京邮电大学 A method of more people's Attitude estimations are carried out using novel loss function
CN110610154A (en) * 2019-09-10 2019-12-24 北京迈格威科技有限公司 Behavior recognition method and apparatus, computer device, and storage medium
CN110728209A (en) * 2019-09-24 2020-01-24 腾讯科技(深圳)有限公司 Gesture recognition method and device, electronic equipment and storage medium
CN110969124A (en) * 2019-12-02 2020-04-07 重庆邮电大学 Two-dimensional human body posture estimation method and system based on lightweight multi-branch network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SAEID VOSOUGHI等: "Deep 3D Human Pose Estimation Under Partial Body Presence", 《2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)》 *
许忠雄: "监控视频实时多人姿态估计算法研究与应用", 《CNKI优秀硕士学位论文全文库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966519A (en) * 2021-02-01 2021-06-15 湖南大学 Method, system and storage medium for positioning reference phrase
CN112966519B (en) * 2021-02-01 2023-10-10 湖南大学 Phrase positioning method, system and storage medium
CN113221824A (en) * 2021-05-31 2021-08-06 之江实验室 Human body posture recognition method based on individual model generation
CN116824631A (en) * 2023-06-14 2023-09-29 西南交通大学 Attitude estimation method and system
CN116824631B (en) * 2023-06-14 2024-02-27 西南交通大学 Attitude estimation method and system

Also Published As

Publication number Publication date
CN111898566B (en) 2023-02-03

Similar Documents

Publication Publication Date Title
CN110059558B (en) Orchard obstacle real-time detection method based on improved SSD network
CN112200111B (en) Global and local feature fused occlusion robust pedestrian re-identification method
CN111898566B (en) Attitude estimation method, attitude estimation device, electronic equipment and storage medium
CN112597941B (en) Face recognition method and device and electronic equipment
CN108062525B (en) Deep learning hand detection method based on hand region prediction
CN110210335B (en) Training method, system and device for pedestrian re-recognition learning model
CN112036260B (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN115496928B (en) Multi-modal image feature matching method based on multi-feature matching
CN114187665B (en) Multi-person gait recognition method based on human skeleton heat map
CN113361542A (en) Local feature extraction method based on deep learning
CN114821014A (en) Multi-mode and counterstudy-based multi-task target detection and identification method and device
Tang et al. HIC-YOLOv5: Improved YOLOv5 for small object detection
Wu et al. Single shot multibox detector for vehicles and pedestrians detection and classification
Zhu et al. DualDA-Net: Dual-head rectification for cross-domain object detection of remote sensing
CN112633100B (en) Behavior recognition method, behavior recognition device, electronic equipment and storage medium
Kumar et al. Mediapipe and cnns for real-time asl gesture recognition
CN110348395B (en) Skeleton behavior identification method based on space-time relationship
CN112487926A (en) Scenic spot feeding behavior identification method based on space-time diagram convolutional network
CN112967317B (en) Visual odometry method based on convolutional neural network architecture in dynamic environment
CN113221824B (en) Human body posture recognition method based on individual model generation
CN113158870B (en) Antagonistic training method, system and medium of 2D multi-person gesture estimation network
CN114821777A (en) Gesture detection method, device, equipment and storage medium
Wang et al. Research on gesture recognition and classification based on attention mechanism
Yang et al. Adaptive fusion of RGBD data for two-stream FCN-based level set tracking
CN117409483B (en) Virtual reality interaction method and system based on self-adaptive joint space-time diagram convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant