CN111898566A

CN111898566A - Attitude estimation method, attitude estimation device, electronic equipment and storage medium

Info

Publication number: CN111898566A
Application number: CN202010771698.2A
Authority: CN
Inventors: 高联丽; 代燕; 王轩瀚; 宋井宽
Original assignee: Chengdu Jingzhili Technology Co ltd; University of Electronic Science and Technology of China
Current assignee: Chengdu Jingzhili Technology Co ltd; University of Electronic Science and Technology of China
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-11-06
Anticipated expiration: 2040-08-04
Also published as: CN111898566B

Abstract

The invention discloses a posture estimation method, a posture estimation device, electronic equipment and a storage medium, and aims to solve the technical problem of improving the posture estimation accuracy in crowded scenes. The method comprises the following steps: extracting visual features from an area image defined by the pedestrian detection frame; identifying all joints in the area image according to the visual features and establishing a candidate joint set; evaluating all joints in the candidate joint set and obtaining target joint information of a target pedestrian example in the area image; and generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance. All joints in the image of the area are identified through the extracted visual features, a candidate joint set is established, at the moment, the candidate joint set comprises a target joint and an interference joint, all joints in the candidate joint set are evaluated, and target joint information of a target pedestrian instance in the image of the area is obtained, so that the attitude estimation accuracy in a crowded scene is improved.

Description

Attitude estimation method, attitude estimation device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of image processing, in particular to a posture estimation method, a posture estimation device, electronic equipment and a storage medium.

Background

Human pose estimation is a fundamental and challenging problem in computer vision, and aims to accurately identify the positions of multiple human bodies and sparse joint positions on a skeleton from a single RGB image. With the application of deep Convolutional Neural Networks (CNNs) and the release of large-scale data sets such as MSCOCO, the attitude estimation method has been greatly developed. They can be roughly classified into bottom-up (i.e., bottom-up, the same below) and top-down (i.e., top-down, the same below) methods. For the bottom-up approach, all human joints are detected first and then grouped into different human instances, and the problem is mostly focused on how to group candidate joints into a single human instance. For the top-down method, the thinking is just opposite, all human body examples are firstly positioned, then attitude estimation is carried out on each pedestrian, and the method mainly focuses on how to design more efficient single person attitude estimation (SPPE). Compared with a bottom-up method which does not need to detect human body examples, the top-down method generally has better attitude estimation performance but lower inference speed.

Although the existing top-down pose estimation method performs better in a simple scene, a great challenge is still faced in a crowded scene. By "crowded scene," it is meant that the RGB image captures a complex real-world scene with highly overlapping pedestrians, severe occlusions, different poses, and multi-scale variations. Aiming at a crowded scene, the existing top-down attitude estimation method has the following two technical problems:

1) the pedestrian detection frame includes a plurality of joints. The current top-down method assumes that each detected pedestrian instance contains only the joints belonging to the target pedestrian, i.e., the target joints. However, crowded scenes typically contain highly occluded or overlapping pedestrians, which means that the generated pedestrian detection frame contains, in addition to the target joint, joints belonging to other pedestrian instances, i.e. interfering joints. Based on the above assumptions, the conventional top-down method may assign different pedestrian labels to the joints of the same person, which may result in irreversible errors once the interfering joint is determined as the target joint. In addition, these interfering joints will also be highly likely to be regarded as target joints of other pedestrians, so the present embodiment cannot excessively suppress the interfering joints while enhancing the target joint response. Since the interfering joint can greatly obscure the prediction of the target joint, it is a very challenging technical problem how to eliminate the interfering joint from a given pedestrian detection.

2) Blurred joints in crowded scenes. The traditional top-down posture estimation method is highly dependent on the extraction of posture visual features, so that the posture visual features extracted from the region image only contain visual appearance and lack the prior knowledge of human body structure. The pose estimator may fail when faced with blurred joints caused by crowded scenes, such as joints that are not visible due to severe occlusion, or joints with highly similar visual appearance. However, humans can well estimate such blurred joints by looking at the conditions of the surrounding area. For example, based on the reasoning ability of human common sense, one can easily infer the location of the "neck" after seeing the "head" and "shoulders". Therefore, another key technical problem is how to embed the modeling capability of common sense knowledge into the current pose estimation method.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: provided are a posture estimation method, a posture estimation device, an electronic device and a storage medium, so as to improve the accuracy of posture estimation in crowded scenes.

The technical scheme adopted by the invention for solving the technical problems is as follows: an attitude estimation method, comprising: extracting visual features from an area image defined by the pedestrian detection frame; identifying all joints in the area image according to the visual features and establishing a candidate joint set; evaluating all joints in the candidate joint set and obtaining target joint information of a target pedestrian example in the area image; and generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance.

According to the attitude estimation method, all joints in the image of the area are identified through the extracted visual features, the candidate joint set is established, at the moment, the candidate joint set comprises the target joint and the interference joint, then all joints in the candidate joint set are evaluated, and target joint information of a target pedestrian instance in the image of the area is obtained, so that the problem that different pedestrian labels may be allocated to the joints of the same person by a traditional top-down method is solved, and the attitude estimation accuracy in a crowded scene is improved.

According to the inventive embodiments provided in the present specification, the process of generating the target joint estimation result from the target joint information is implemented by a target joint estimator modeled by means of common human knowledge. By introducing the target joint estimator modeled by means of common human knowledge, the target joint estimation can be completed by using the reasoning capability of the common human knowledge, and the attitude estimation accuracy in crowded scenes is further improved.

According to an embodiment of the invention provided in the present specification, the target joint information includes a corrected feature obtained by correcting the visual feature by an attention mechanism that takes as an object a target joint obtained by excluding an interfering joint from the candidate joint set after the evaluation.

According to the embodiment of the invention provided by the present specification, the process of evaluating all the joints in the candidate joint set and obtaining the target joint information of the target pedestrian instance in the area image includes modeling the relationship of all the joints in the candidate joint set and based on this, removing the interfering joint to obtain the target joint information of the target pedestrian instance.

According to one aspect of the invention provided by the present specification, there is provided an apparatus for pose estimation. The device is configured as an artificial neural network, comprising: the visual feature extraction module is used for extracting visual features from the region image defined by the pedestrian detection frame; the candidate joint identification module is used for identifying all joints in the region image according to the visual features and establishing a candidate joint set; the target joint information generation module is used for evaluating all joints in the candidate joint set and acquiring target joint information of a target pedestrian instance in the regional image; and the estimated attitude generating module is used for generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance.

The device for estimating the attitude can identify all joints in the image of the area through the extracted visual features and establish a candidate joint set, at the moment, the candidate joint set comprises a target joint and an interference joint, then all joints in the candidate joint set are evaluated, and target joint information of a target pedestrian instance in the image of the area is obtained, so that the problem that different pedestrian labels may be allocated to the joints of the same person by a traditional top-down method is avoided, and the attitude estimation accuracy in a crowded scene is improved.

According to an embodiment of the invention provided in this specification, the estimated pose generation module comprises a target joint estimator modeled by means of human common sense. Similarly, by introducing the target joint estimator modeled by means of common human knowledge, the inference capability of the common human knowledge can be utilized to help complete target joint estimation, and the attitude estimation accuracy in crowded scenes is further improved.

According to an embodiment of the invention provided by the present specification, the target joint information generation module includes a visual feature correction means that corrects a corrected feature obtained after the visual feature by an attention mechanism that takes as an object of attention a target joint obtained by excluding an interfering joint from the candidate joint set after the evaluation.

According to an embodiment of the invention provided by the present specification, the candidate joint identification module comprises a multi-joint thermodynamic diagram generation means for generating a thermodynamic diagram of all joints according to the extracted visual features; the estimated posture generation module comprises a target joint thermodynamic diagram generation device, and the target joint thermodynamic diagram generation device is used for generating a target joint thermodynamic diagram according to a target joint estimation result; the training of the artificial neural network for pose estimation is performed by enhancing the target joint response in the target joint thermodynamic diagram while ensuring that all joints in the multi-joint thermodynamic diagram are in an activated state. Thus, the apparatus for pose estimation may achieve end-to-end training.

According to one aspect of the invention provided by the present specification, an electronic device for pose estimation is provided. The apparatus comprises: a processor; a memory for storing processor-executable instructions; the processor is configured to perform any of the attitude estimation methods described above.

According to an aspect of the invention provided in the present specification, there is also provided a computer-readable storage medium, including a stored program, which when executed performs any one of the above-described attitude estimation methods.

The above-described attitude estimation method, apparatus for attitude estimation, electronic device for attitude estimation, and program stored on a computer-readable storage medium implement a new strategy for attitude estimation. Specifically, the strategy identifies all joints in the area image through the extracted visual features and establishes a candidate joint set, at the moment, the candidate joint set comprises a target joint and an interference joint, then all joints in the candidate joint set are evaluated, and target joint information of a target pedestrian instance in the area image is obtained, so that the problem that a traditional top-down method possibly allocates different pedestrian labels to the joints of the same person is solved, and the attitude estimation accuracy in a crowded scene is improved.

The embodiments of the invention provided in the present specification will be further described with reference to the accompanying drawings and detailed description. Additional aspects and advantages of the invention provided by this specification will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention provided by this specification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to assist in understanding the invention provided herein and, together with the description of the invention provided herein, serve to explain, without limitation, the invention provided herein. In the drawings:

fig. 1 is a schematic flow chart of an embodiment of an attitude estimation method provided in this specification.

Fig. 2 is a block diagram of an artificial neural network corresponding to an embodiment of the pose estimation method provided in the present specification.

Fig. 3 is a frame diagram of an artificial neural network corresponding to a multi-joint relationship analysis link in an embodiment of a posture estimation method provided in the present specification.

Fig. 4 is a framework diagram of an artificial neural network corresponding to a joint refinement segment in an embodiment of a posture estimation method provided in this specification.

Detailed Description

The invention provided in this specification will be described more clearly and completely with reference to the accompanying drawings. The person skilled in the art will be able to carry out the invention provided in this description on the basis of these descriptions. Before the invention provided in this specification is explained with reference to the drawings, it should be particularly pointed out that:

in the invention provided in the present specification, the technical solutions and the technical features provided in the respective portions including the following description may be combined with each other without conflict.

In addition, the embodiments of the invention provided in the present specification and referred to in the following description are generally only a part of the embodiments of the invention provided in the present specification and not all of the embodiments, and therefore, all other embodiments obtained by those skilled in the art without any inventive work based on the embodiments of the invention provided in the present specification should fall within the scope of the invention protection provided in the present specification.

With respect to the terms and units in the invention creation provided in this specification: the terms "comprising," "including," "having," and any variations thereof in the description and claims and any related parts of the inventive subject matter presented in this specification are intended to cover non-exclusive inclusions. In addition, other related terms and units in the invention and creations provided by the specification can be reasonably interpreted based on the related contents of the invention and creations provided by the specification.

Fig. 1 is a schematic flow chart of an embodiment of an attitude estimation method provided in this specification. As shown in fig. 1, the attitude estimation method includes: step S001, extracting visual features from an area image defined by a pedestrian detection frame; s002, identifying all joints in the area image according to the visual features and establishing a candidate joint set; s003, evaluating all joints in the candidate joint set to obtain target joint information of a target pedestrian instance in the area image; and step S004, generating a target joint estimation result according to the target joint information so as to generate an estimation attitude corresponding to the target pedestrian instance.

The above-described attitude estimation method is implemented by means of an apparatus for attitude estimation, which is configured as an artificial neural network, including: the visual feature extraction module is used for extracting visual features from the region image defined by the pedestrian detection frame; the candidate joint identification module is used for identifying all joints in the region image according to the visual features and establishing a candidate joint set; the target joint information generation module is used for evaluating all joints in the candidate joint set and acquiring target joint information of a target pedestrian instance in the regional image; and the estimated attitude generating module is used for generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance.

Fig. 2 is a block diagram of an artificial neural network corresponding to an embodiment of the pose estimation method provided in the present specification. Fig. 3 is a frame diagram of an artificial neural network corresponding to some links in the embodiment of the posture estimation method provided in this specification. Fig. 4 is a frame diagram of an artificial neural network corresponding to some links in the embodiment of the posture estimation method provided in this specification. The above-described attitude estimation method and apparatus for attitude estimation will now be further described with reference to fig. 2-4.

Extraction and multi-joint prediction of visual features of pedestrian detection frame

The pedestrian detection frame is obtained by a pedestrian detector. The pedestrian detector is a conventional technology for detecting all pedestrian instances from an input image and assigning a pedestrian detection frame to each pedestrian instance, and the area image defined in each pedestrian detection frame is mainly displayed as a corresponding target pedestrian instance. In the above-mentioned posture estimation method, the pedestrian detection frame is received from the artificial neural network for posture estimation, and the estimated posture of the target pedestrian instance corresponding to the pedestrian detection frame is finally generated as a starting point.

In this embodiment, the step of extracting the visual features from the region image defined by the pedestrian detection frame may be implemented by the visual encoder 101 based on Convolutional Neural Networks (CNNs). The visual encoder 101 belongs to the visual feature extraction module.

Preferably, the visual feature f of the area image I defined by the pedestrian detection frame is extracted_I∈R^H*W*DH represents the height of the area image, W represents the width of the area image, and D represents the dimension of each pixel point in the area image corresponding to the visual feature. The visual encoder 101 is preferably an HRNet encoder (for a detailed description of the HRNet encoder, see "DongLiu Ke Sun, Bin Xiao and Jingdong Wang.2019.deep High-resolution prediction for Human Pose estimation. in CVPR.5693-5703").

In the past, the visual features obtained by the HRNet encoder were used to directly predict the target joint, while in the present embodiment, the visual features obtained by the HRNet encoder were used to predict all joints in the region image defined by the pedestrian detection frame, that is, the target joint and the interfering joint.

Secondly, realizing the identification and evaluation of candidate joints through a multi-joint relation analyzer

After predicting all joints in the pedestrian detection frame based on the visual encoder 101, the embodiment provides a multi-joint relation analyzer, and a relation graph is established among the candidate joint set, so that the interference problem is solved. Specifically, the proposed multi-joint relationship resolver comprises two main parts: (1) a relational encoder; (2) and (5) interference elimination. The following describes the relationship encoder and interference rejection in detail with reference to fig. 3.

(1) Relation encoder

Given the I-th pedestrian detection frame, the present embodiment first looks at its visual characteristic f_IEstimating the multi-joint thermodynamic diagram H ═ psi_m(f_I,W_m) Wherein ψ_mIs a parameter W_mIs used as a pixel level multi-joint estimator. Next, a set of N is generated from the multijoint thermodynamic diagram H according to the set threshold_pA candidate joint

Wherein P is_iRepresenting a candidate joint in the set of candidate joints. The generation of the multi-joint thermodynamic diagram and the generation of the candidate joints belong to all joints in the area image according to the visual features, and a candidate joint set is established.

For each candidate joint P_iGiven three types of features

Wherein b is_i＝(Δx_i,Δy_i,x_i,y_i) Position information indicating the candidate joint, c_iIs the category information of the candidate joint, i.e. the one-hot category characterization of the corresponding C-type joint (e.g. CrowPose defines a 14-type joint). In detail, (x)_i,y_i) Represents the candidate joint coordinates (Δ x)_i,Δy_i) Then it is the offset of the candidate joint from the center point of the body. In addition, v_iVisual information representing the joint candidate, i.e., the pedestrian detection frame region image visual feature f_IIn position (x)_i,y_i) The corresponding joint visual representation of the pixel points. Then, joint pair relation coding is carried out on the candidate joint set, and the joint pair relation coding comprises the following steps: 1) geometric coding

2) Class coding

3) Visual coding

Wherein, b^eAnd c^eThe geometric associations and class semantics of the joint pairs are separately encoded, and v^eIs a coded modeling of the visual representation relationship of the joint pair, N_pRepresenting the size of the set of candidate joints, and d_b,d_cAnd d_vRespectively representing the characteristic dimensions after coded modeling of the joint.

Since the geometric information of the joints provides the relative relationship between the joints of the human body and the center of the human body, the present embodiment uses such relative geometric information for geometric encoding. For candidate joint pairs i and j, the geometric relationship encoding can be calculated as:

wherein, W_beIs to map the original geometric information to d_bScaling parameters of the dimensional feature vectors, b_iAnd b_jJoint position information of the candidate joint pair i and j is represented.

To extract the relationship code from the category information and the visual representation, the present embodiment generates the category relationship code and the visual relationship code for the candidate joint pair i and j using the following formulas:

where V and U are linear matrices that project the input to the eigenvector and T represents the matrix transpose. Sigma is a non-linear ReLU activation function,

between representing matricesElement-by-element multiplication. W_ceAnd W_veFor mapping relational features to high-dimensional feature vectors d, respectively_cAnd d_v. Furthermore, c_iAnd c_jClass information, v, representing the candidate joint pair i and j_iAnd v_jVisual information of the candidate joint pair i and j is represented.

Next, the relationships of all candidate joint pairs are encoded (geometric relationship encoding b)^eClass relation coding c^eAnd visual relation coding v^e) Cascading together, the relationship characteristics can be obtained

Then, the linear function ψ after activation by sigmoid_a(E^r,W_r) Obtaining a joint relation graph between candidate joint pairs

Wherein,

is a linear transformation parameter, #_aIs a parameter W_rA relationship estimator of joint pairs. Element A in the joint relation diagram_i,jIndicating the likelihood that the candidate joint i and the candidate joint j belong to the same pedestrian instance.

(2) Interference rejection

Given a set of candidate joints P, the goal is to eliminate the interfering joints. Assume that there are C multi-joint thermodynamic diagrams for C joint classes. For the kth joint class, there is N_kCandidate joints scored by their thermodynamic diagrams_kAnd joint relation graph a, re-scoring them, as calculated as follows:

S(i)＝S_v(i)+H_k(i)

wherein S is_v(i) Is the average relationship score of the ith candidate joint for which the traversal pairCandidate set of joints N_kOther N in (1)_k1 candidate joints to rescale, I [ I ≠ j]Is an indication function, the function output is 1 if and only if the candidate joint i and the candidate joint j represent different candidate joints, and a (i, j) is the corresponding element value of the candidate joint i and the candidate joint j in the joint relation diagram a. Furthermore, H_k(i) Is the potential energy obtained from the kth multi-joint thermodynamic diagram, and s (i) is the final score for the ith candidate joint. In the forward estimation, the joint with the highest score will be set as the target joint of the kth-class joint. For simplicity, a set of target joints is represented as

Finally, the target joint attention map A^tA joint label is generated by applying a gaussian-based thermodynamic diagram generation method with the target joint.

Third, target joint information generation and utilization

The pose estimator may fail when faced with blurred joints caused by crowded scenes, such as joints that are not visible due to severe occlusion, or joints with highly similar visual appearance. To address this problem, the present embodiment introduces a joint refiner to establish the pose correction mechanism. In particular, the joint refiner modifies the visual features and the multi-joint estimator separately, the process comprising two steps: (1) correcting the posture characteristic; (2) and (5) correcting the attitude estimator. The following describes the relationship encoder and interference rejection in detail with reference to fig. 4.

The aim of posture characteristic correction is to enable a posture estimation model to focus more on the most relevant visual characteristic region, and the aim can be achieved by means of a target joint attention diagram generated by a multi-joint relation analyzer. Specifically, the present embodiment first displays the visual feature f of the image of the pedestrian detection frame region_IAnd corresponding target joint attention map a^tThe cascade pairs are modified. Then, using a correction network with continuous convolution layers, the most relevant visual signals are extracted from the visual features under the guidance of an attention map, thereby obtaining the corrected visual features f^r。

After feature modification, the present embodiment requires a robust pose estimator for target joint estimation. In general, one can infer not only the position of the target joint from visual signals, but also their position from common sense. In light of this, the present embodiment constructs a target joint estimator by migrating joint analytic knowledge from a multi-joint estimator by common sense knowledge modeling. Specifically, the present embodiment constructs two main modules in the pose estimator modification: 1) generating a knowledge graph; 2) and (5) knowledge migration.

1) Knowledge graph generation

As previously mentioned, the relevant a priori knowledge can better estimate the position of the body joint. For example, one can easily infer the location of the "neck" after knowing where the "head" is. This is because there is a strong relationship between these two joint classes. In order to model the common sense relationships of such pairs of joint classes, the present embodiment requires the establishment of a knowledge map to provide information on the relationships between the joint classes. The knowledge graph generation mechanism designed by the application comprises the following steps: semantic relationship K_s∈R^C×CAnd the common sense relationship K_c∈R^C×C. Wherein C represents C different joint classes (for example, CrowPose defines 14 classes of joints), the semantic relationship is a "soft connection" mode (value is between 0 and 1) existing between different joint classes obtained from the language model naturally, and the common sense relationship is a real "hard connection" mode (value is not 0, i.e. 1) between different joint classes.

To construct the semantic relationships, the present embodiment first extracts the semantic embedded vectors of the joints from the language model and then calculates their similarity scores. For example, given an ith joint class and a jth joint class, their semantic similarity scores are calculated as follows:

W2V (·) is a Word2Vec function for extracting a semantic embedded vector of a joint from a language model, | | · | | | | represents an euclidean norm of the feature vector, and T represents a matrix transposition. It can be observed from the obtained semantic relationship that if there is connectivity between the two types of joints, their similarity scores are relatively high.

In order to construct a common sense relationship representing the natural connection relationship of joints, this embodiment introduces an identification function I (s, k) whose output is 1 and only if there is a "connection" relationship between two joint classes (for example, if there is a connection between the neck and the head, it is set to 1), and the formula is as follows:

therefore, the common sense knowledge map K finally about human body architecture_g∈R^C×CThe calculation is as follows:

wherein,

table element by element multiplication. D-dimensional parameters W for a given multi-joint estimator_m∈R^C×DAnd the common sense knowledge map K constructed as above_g∈R^C×CIn this embodiment, the matrix multiplier may be used to obtain the related resolution parameter W_k＝K_g ^TW_m,→R^C×D。

2) Knowledge migration

Obtaining the related analysis parameter W_kThe present embodiment then requires the conversion of these parameters into parameters for the target joint estimator. To this end, the present embodiment constructs a small network with two linear functions after the LeakyReLU activation function to obtain the parameters W of the target joint estimator_t∈R^C×DThe calculation is as follows:

W_t＝Φ(W_kW_t ¹)W_t ²

wherein, W_t ¹∈R^D×DAnd W_t ²∈R^D×DAre two linear transformation matrices and Φ (·) represents a nonlinear LeakyReLU activation function.

Thus, the parameter W can be set by the parameter_tFrom the corrected visual characteristics f^rMore reasonably estimate the target joint thermodynamic diagram O. This is because, unlike the conventional method that only uses the parameters of the target joint to perform target pedestrian attitude estimation, the present application also uses the joint analysis parameters having a connection relationship with the target joint to obtain more accurate target pedestrian attitude estimation. For example, if the "head" of a pedestrian is occluded while the "neck" is visible, the embodiment can use the analytic parameters of the neck to make a more reasonable inference on the head.

Fourth, model training

As described above, the present embodiment has explained the posture estimation network model based on the relational modeling in detail. In order to enable the proposed model to better perform target joint estimation in crowded scenes, the present embodiment designs a corresponding learning target to train model parameters. Given a pedestrian detection frame, inputting the area image thereof into the model of the present embodiment will obtain three types of thermodynamic diagrams: 1) a target joint thermodynamic diagram O; 2) a multijoint thermodynamic diagram H; 3) the joint relation diagram A. Specifically speaking: first, the present embodiment inputs the region image of the pedestrian detection frame into the pose estimation model of the present embodiment to extract the corresponding visual feature. Then, all joints in the region image are identified based on the visual features, i.e., a multi-joint thermodynamic diagram H including the interfering joint and the target joint is generated to ensure that all joints are in an activated state. Then, a candidate joint set is established according to the multi-joint thermodynamic diagram and all joints are evaluated, namely, a joint relation diagram A is generated. And finally, obtaining target joint information of a target pedestrian instance in the area image according to the joint relation graph, and generating a target joint estimation result according to the target joint information so as to generate a corresponding target joint thermodynamic diagram O. The goal of this embodiment is to enhance the target joint response in the target joint thermodynamic diagram O while ensuring that all joints in the multi-joint thermodynamic diagram H are active. To achieve this learning goal, the present embodiment performs supervised learning on the multi-joint thermodynamic diagram and the target joint thermodynamic diagram using Mean Square Error (MSE), and the loss function is defined as follows:

wherein,

and

the joint thermodynamic diagrams are a true value target joint thermodynamic diagram and a multi-joint thermodynamic diagram corresponding to C different joint classes respectively. In detail, the present invention is described in detail,

only a unimodal gaussian distribution for the target joint is included, which is directly obtainable from the annotation data of the data set. However,

contains a multimodal gaussian distribution for the target joint and the interfering joints, and the truth of the multijoint thermodynamic diagram is not given directly in the data set. In order to obtain a real value of the multi-joint thermodynamic diagram corresponding to a certain pedestrian detection frame, firstly, a target joint with a label in the detection frame is collected, then other pedestrian detection frames are traversed, and if the target joint of the pedestrian detection frame is contained in the detection frame, the target joint is labeled as an interference joint of the detection frame.

In addition, in order to obtain better target joint information to generate a better estimated target joint, the present embodiment needs to perform supervised learning on the joint relationship diagram, and the present embodiment adopts the same strategy to calculate the mean square error between the joint relationship diagram and its true value, and the loss function is as follows:

wherein,

is the truth value of the joint relation chart and has the size of N_p*N_p(N_pThe number of all possible joints in the frame, i.e. including the target joint and the interfering joint, is detected for the pedestrian). Each element A_i,jRepresented by 0 or 1, the value 1 is if and only if the ith and jth joints belong to the same pedestrian instance. Thus, the learning objective for the entire model is calculated as follows:

L＝αl_t+βl_j+θl_r

where α, β, and θ are weight hyperparameters corresponding to learning targets, and are all set to 1 in the present embodiment.

Fifth, effect and test

Compared with the prior art, the contribution of the application lies in that:

1) the present embodiment introduces a novel strategy to address multiple joints in the pedestrian detection box, including the target joint and the interfering joint. This is the first attempt to model the relationship of all joints in one pedestrian detection box to eliminate the interfering joints. This new strategy will greatly alleviate confusion in the pose estimation model when dealing with the same joint with completely different pedestrian labels.

2) Inspired by the fact that the positions of the fuzzy joints can be well estimated by a human being by looking at the surrounding area, the embodiment introduces a posture correction method with common sense modeling capability, namely, the posture estimation result is improved by correcting a posture estimator and posture visual characteristics.

3) A large number of experimental results show that the posture estimation network model based on the relational modeling is superior to the current latest posture estimation method, especially on a challenging crowdPose data set.

The contents of the present invention have been explained above. Those skilled in the art will be able to implement the invention based on these teachings. Based on the above disclosure of the present invention, all other preferred embodiments and examples obtained by a person skilled in the art without any inventive step should fall within the scope of protection of the present invention.

Claims

1. An attitude estimation method, characterized by comprising:

extracting visual features from an area image defined by the pedestrian detection frame;

identifying all joints in the area image according to the visual features and establishing a candidate joint set;

evaluating all joints in the candidate joint set and obtaining target joint information of a target pedestrian example in the area image; and

and generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance.

2. The attitude estimation method according to claim 1, characterized in that the process of generating the target joint estimation result from the target joint information is realized by a target joint estimator modeled by means of common human knowledge.

3. The attitude estimation method according to claim 1, characterized in that: the target joint information includes a corrected feature obtained by correcting the visual feature by an attention mechanism that takes as an object of interest a target joint obtained by excluding an interfering joint from the candidate joint set after the evaluation.

4. The attitude estimation method according to claim 1, characterized in that: the process of evaluating all joints in the candidate joint set and obtaining the target joint information of the target pedestrian instance in the area image comprises the steps of modeling the relation of all the joints in the candidate joint set and removing interference joints according to the modeling so as to obtain the target joint information of the target pedestrian instance.

5. An apparatus for pose estimation, configured as an artificial neural network, comprising:

the visual feature extraction module is used for extracting visual features from the region image defined by the pedestrian detection frame;

the candidate joint identification module is used for identifying all joints in the region image according to the visual features and establishing a candidate joint set;

the target joint information generation module is used for evaluating all joints in the candidate joint set and acquiring target joint information of a target pedestrian instance in the regional image; and

and the estimated attitude generating module is used for generating a target joint estimation result according to the target joint information so as to generate an estimated attitude corresponding to the target pedestrian instance.

6. The apparatus for pose estimation according to claim 5, wherein: the estimated pose generation module includes a target joint estimator modeled with human common sense.

7. The apparatus for pose estimation according to claim 5, wherein: the target joint information generation module includes a visual feature correction means that corrects a correction feature obtained after the visual feature is corrected by an attention mechanism that takes as an object of attention a target joint obtained by excluding an interfering joint from the candidate joint set after the evaluation.

8. The apparatus for pose estimation according to claim 5, wherein:

the candidate joint identification module comprises a multi-joint thermodynamic diagram generation means for generating a thermodynamic diagram of all joints from the extracted visual features;

the estimated posture generation module comprises a target joint thermodynamic diagram generation device, and the target joint thermodynamic diagram generation device is used for generating a target joint thermodynamic diagram according to a target joint estimation result;

the training of the artificial neural network for pose estimation is performed by enhancing the target joint response in the target joint thermodynamic diagram while ensuring that all joints in the multi-joint thermodynamic diagram are in an activated state.

9. An electronic device for pose estimation, characterized by: the method comprises the following steps:

a processor;

a memory for storing processor-executable instructions;

the processor is configured to perform the pose estimation method of any of claims 1-4.

10. A computer-readable storage medium, characterized in that: comprising a stored program which when executed performs the pose estimation method of any of claims 1-4.