CN113378852A

CN113378852A - Key point detection method and device, electronic equipment and storage medium

Info

Publication number: CN113378852A
Application number: CN202110568016.2A
Authority: CN
Inventors: 李帮怀; 袁野
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-09-10
Also published as: WO2022247403A1

Abstract

The invention provides a key point detection method, a key point detection device, electronic equipment and a storage medium, wherein the method comprises the following steps: extracting image characteristics of an image to be detected through a backbone network to obtain a characteristic diagram, wherein the image to be detected comprises a target object; performing key point detection on the feature map according to a plurality of posture key point templates to obtain at least one group of key points corresponding to the target object, wherein the posture key point templates represent the relative position relationship of a plurality of key points in the posture key point templates; and screening the at least one group of key points to obtain a key point detection result of the target object. According to the method, the key point detection results of all target objects in the image to be detected can be directly determined based on the plurality of posture key point templates, so that the detection efficiency of the key points can be improved.

Description

Key point detection method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a method and an apparatus for detecting a keypoint, an electronic device, and a storage medium.

Background

The key point detection is widely applied in daily life, common face recognition algorithms usually depend on the detection of key points, and the application of fashion, beauty, face changing and the like is based on the key point detection technology, so that the high requirement on the key point detection precision is emphasized.

In the prior art, a common key point detection method is generally two-stage, in the first stage, a target detection model obtains the position of a target object in an image, in the second stage, the target object is extracted according to a detected target frame, and then the key point detection is performed by the key point detection model, and this way is also generally called a "top-down" way. Due to the fact that the method needs to be carried out step by step, when one image comprises a plurality of target objects, multiple times of image matting needs to be carried out, multiple times of key point detection models need to be used for detection respectively, and detection efficiency is low.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide a keypoint detection method, an apparatus, an electronic device, and a storage medium that overcome the above problems or at least partially solve the above problems.

According to a first aspect of the embodiments of the present invention, there is provided a method for detecting a key point, including:

extracting image characteristics of an image to be detected through a backbone network to obtain a characteristic diagram, wherein the image to be detected comprises a target object;

performing key point detection on the feature map according to a plurality of posture key point templates to obtain at least one group of key points corresponding to the target object, wherein the posture key point templates represent the relative position relationship of a plurality of key points in the posture key point templates;

and screening the at least one group of key points to obtain a key point detection result of the target object.

According to a second aspect of the embodiments of the present invention, there is provided a method for detecting a keypoint, including:

performing key point detection on the feature map according to a plurality of posture key point templates to obtain at least one group of key point offsets corresponding to the target object, wherein the posture key point templates represent the relative position relationship of a plurality of key points in the posture key point templates, and each group of key point offsets represent the offsets of the group of key points in the feature map and the key points in each posture key point template;

and obtaining a key point detection result of the target object according to the at least one group of key point offsets.

According to a third aspect of the embodiments of the present invention, there is provided a key point detecting apparatus including:

the characteristic extraction module is used for extracting image characteristics of an image to be detected through a backbone network to obtain a characteristic diagram, wherein the image to be detected comprises a target object;

the key point detection module is used for detecting key points of the feature map according to a plurality of posture key point templates to obtain at least one group of key points corresponding to the target object, wherein the posture key point templates represent the relative position relation of a plurality of key points in the posture key point templates;

and the detection result determining module is used for screening the at least one group of key points to obtain the key point detection result of the target object.

According to a fourth aspect of the embodiments of the present invention, there is provided a keypoint detection apparatus, including:

the key point detection module is used for detecting key points of the feature map according to a plurality of posture key point templates to obtain at least one group of key point offsets corresponding to the target object, wherein the posture key point templates represent the relative position relation of a plurality of key points in the posture key point templates, and each group of key point offsets represent the offsets of the group of key points in the feature map and the key points in each posture key point template;

and the detection result determining module is used for obtaining the key point detection result of the target object according to the at least one group of key point offsets.

According to a fifth aspect of embodiments of the present invention, there is provided an electronic apparatus, including: a processor, a memory and a computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, implements the keypoint detection method as described in the first or second aspect.

According to a sixth aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the keypoint detection method according to the first or second aspect.

According to the key point detection method, the key point detection device, the electronic equipment and the storage medium provided by the embodiment of the invention, after the feature map of the image to be detected is extracted through the backbone network, the key point detection is carried out on the feature map according to the plurality of attitude key point templates to obtain at least one group of key points corresponding to the target object, and the at least one group of key points is screened to obtain the key point detection result of the target object.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.

Fig. 1 is a flowchart illustrating steps of a method for detecting a key point according to an embodiment of the present invention;

2a-2c are exemplary diagrams of pose keypoint templates in an embodiment of the invention;

FIG. 3 is a flowchart illustrating steps of another method for detecting a keypoint in an embodiment of the present invention;

fig. 4 is a block diagram of a key point detecting apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram of another key point detecting device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 is a flowchart illustrating steps of a method for detecting a keypoint, according to an embodiment of the present invention, as shown in fig. 1, the method may include:

step 101, extracting image characteristics of an image to be detected through a backbone network to obtain a characteristic diagram, wherein the image to be detected comprises a target object.

The backbone network is used for extracting image features of an image to be detected, and may be a ResNet-50 network or the like, for example. The target object may be, for example, a human face, a human body, a pet, a vehicle, or the like.

And inputting the image to be detected into a backbone network such as ResNet-50 and the like to obtain high-dimensional characteristic representation of the image to be detected, namely obtaining a characteristic diagram.

And 102, performing key point detection on the feature graph according to a plurality of posture key point templates to obtain at least one group of key points corresponding to the target object, wherein the posture key point templates represent the relative position relationship of a plurality of key points in the posture key point templates.

The key point detection method in the embodiment of the invention is a bottom-up key point detection method, is a single-stage complete end-to-end key point detection method, can input any image to be detected, directly obtains key point positioning results of all target objects in the image to be detected, does not need to use a target detection model to position and scratch the target objects, only needs 1 time in the forward process of the whole model, and is in direct proportion to the number of the target objects in the image in the traditional top-down method because the scratch is needed.

The plurality of posture key point templates are the relative position relations of key points corresponding to different postures which are defined in advance, each posture key point template corresponds to a key point under one posture, and the number of the key points in each posture key point template is not necessarily the same because the key points can be shielded under some postures. 2a-2c are exemplary diagrams of pose keypoint templates in an embodiment of the invention. As shown in fig. 2a-2c, when the target object is a human face, three different pose keypoint templates are provided, and one pose keypoint template defines a relative position relationship of keypoints corresponding to one pose. The relative position relationship may be, for example, a position of each key point relative to one of the center points, each pose key point template includes the same center point, and when the target object is a human face, the center point may be, for example, a point corresponding to a nose center (i.e., a nose tip position) in the key points, which is not limited in the embodiment of the present invention.

After the feature map of the image to be detected is extracted, the feature map can be directly input into a key point detection model, and the key points in the image to be detected are subjected to regression calculation through the key point detection model based on a plurality of posture key point templates to obtain at least one group of key points corresponding to the target object.

And 103, screening the at least one group of key points to obtain a key point detection result of the target object.

And after at least one group of key points corresponding to the target object is obtained, screening the at least one group of key points to screen out a key point group which really contains the target object, and obtaining a key point detection result of the target object in the image to be detected. The key point detection model in the embodiment of the invention does not need to perform target object matting on the image to be detected firstly, can directly obtain the key point detection results of all target objects in the image to be detected based on the gesture key point template, and can obtain the key point detection results of a plurality of target objects through one-time detection when the image to be detected comprises a plurality of target objects.

In an embodiment of the present invention, performing keypoint detection on the feature map according to a plurality of pose keypoint templates to obtain at least one group of keypoints corresponding to the target object, includes: a plurality of candidate keypoint offsets and confidence degrees corresponding to the candidate keypoint offsets, wherein the candidate keypoint offsets are offsets of the feature points in the feature map and the keypoints in each posture keypoint template; and determining at least one group of key points corresponding to the target object according to the candidate key point offsets and the confidence degrees.

The feature point may be a pixel point in the feature map, or may also be a set of a plurality of pixel points in a specific region in the feature map.

Inputting the feature map into a key point detection model, carrying out key point detection on the feature map by the key point detection model based on a plurality of posture key point templates, returning the offset of the feature point in the feature map and the key point in each posture key point template to obtain a plurality of candidate key point offsets, and determining the confidence corresponding to the candidate key point offsets. After obtaining a plurality of candidate key point offsets and confidence coefficients, preliminarily screening the candidate key point offsets based on the confidence coefficients to select the candidate key point offsets with the confidence coefficients meeting preset conditions, and determining key point coordinates based on the key point template, the feature points and the screened candidate key point offsets to obtain at least one group of key points corresponding to the target object; or after obtaining a plurality of candidate keypoint offsets and confidence degrees, determining a group of keypoint coordinates corresponding to a group of candidate keypoint offsets based on the keypoint template, the feature points and the keypoint offsets, and obtaining at least one group of keypoints corresponding to the target object.

In an embodiment of the present invention, determining at least one group of keypoints corresponding to the target object according to the candidate keypoint offsets and the confidence degrees includes: screening out a combination of the keypoint offsets with the confidence degrees larger than or equal to a confidence degree threshold value from the candidate keypoint offsets, determining the screened combination of the keypoint offsets as at least one group of keypoint offsets, wherein each group of keypoint offsets represents the offsets of the keypoint group in the feature map and the keypoint in each posture keypoint template; and determining at least one group of key points corresponding to the target object according to the at least one group of key point offsets and the posture key point template corresponding to each group of key point offsets.

The confidence of a candidate keypoint offset combination in the multiple candidate keypoint offsets may be relatively small, and such a candidate keypoint offset cannot obtain a correct keypoint, so that a combination of keypoint offsets with confidence greater than or equal to a confidence threshold value can be screened out from the multiple candidate keypoint offsets, the screened combination of one keypoint offset is determined to be a group of keypoint offsets, and multiple groups of keypoint offsets are obtained when the combinations of the multiple keypoint offsets are screened out, thereby obtaining at least one group of keypoint offsets. Based on the at least one group of key point offsets, the posture key point template corresponding to each group of key point offsets and the feature points for obtaining the group of key point offsets, the coordinates of the at least one group of key points can be determined, and the at least one group of key points corresponding to the target object can be obtained. The candidate key points are preliminarily screened based on the confidence coefficient, so that the calculation amount can be reduced, and the processing speed is improved.

In an embodiment of the present invention, the screening the at least one group of key points to obtain a key point detection result of the target object includes: and determining a key point detection result of the target object according to the at least one group of key points and the confidence corresponding to each group of key points.

After at least one group of key points corresponding to the target object is determined based on the plurality of candidate key point offsets and the confidence degrees, the confidence degree of each group of key points in the at least one group of key points is the confidence degree of the candidate key point offsets of the group of key points, so that the at least one group of key points can be screened based on the confidence degrees corresponding to the at least one group of key points, and the key point detection result of the target object is determined. When the at least one group of key points are screened, the screening can be performed based on non-maximum suppression processing, when the non-maximum suppression processing is performed, the non-maximum suppression processing can be directly performed on the at least one group of key points, or the non-maximum suppression processing can be performed on the target frames corresponding to the at least one group of key points after the target frames of each group of key points are determined.

In an optional embodiment, the determining, according to the confidence corresponding to the at least one group of keypoints and each group of keypoints, a keypoint detection result of the target object includes: respectively determining a target frame corresponding to each group of key points; and performing non-maximum suppression processing on the target frames corresponding to the at least one group of key points according to the confidence degrees corresponding to each group of key points to obtain the key point detection result of the target object.

Wherein the target box represents a location where a target object is located.

After obtaining at least one group of key points and the confidence degrees corresponding to each group of key points, the target frames corresponding to each group of key points can be firstly determined respectively, and then the target frames corresponding to at least one group of key points are subjected to non-maximum suppression processing based on the confidence degrees corresponding to each group of key points, so as to obtain a final key point detection result. After the target frame corresponding to the key point is determined, non-maximum suppression processing is performed based on the target frame, and compared with the method of directly performing non-maximum suppression processing on the key point, the method can reduce the processing data volume and further improve the detection efficiency.

In an optional implementation manner, the separately determining the target boxes corresponding to each group of key points includes: and respectively determining the minimum external rectangle corresponding to each group of key points as the target frame corresponding to the group of key points.

When determining a target frame corresponding to a group of key points, a minimum bounding rectangle of the group of key points may be determined first, and the minimum bounding rectangle is determined as the target frame corresponding to the group of key points.

The method comprises the steps of obtaining at least one group of key points corresponding to a target object and confidence degrees corresponding to each group of key points while detecting the feature points, screening at least one group of key points based on the confidence degrees corresponding to each group of key points to obtain a key point detection result, accurately screening the key points of the target object, and improving the accuracy of key point detection.

In an embodiment of the present invention, performing keypoint detection on the feature map according to a plurality of pose keypoint templates to obtain a plurality of candidate keypoint offsets and confidence degrees corresponding to the candidate keypoint offsets includes: and matching a plurality of the attitude key point templates with the feature map respectively based on each feature point in the feature map, determining the offset of the feature point and the key point in each attitude key point template and the confidence coefficient of the feature point, and obtaining at least one group of key point offsets and the confidence coefficient corresponding to each group of key point offsets.

In some embodiments of the present invention, each feature point in the feature map is respectively used as a central point for matching the feature map with the pose key point template, the plurality of pose key point templates are respectively matched with the feature map, the offset of the central point and the key point in each pose key point template is determined according to the relative position relationship between the key point and the central point in the plurality of pose key point templates, one feature point corresponding to the plurality of pose key point templates can obtain a plurality of sets of key point offsets, the plurality of feature points can obtain a plurality of sets of key point offsets, so that after each feature point is matched with the pose key point template, at least one set of key point offsets can be obtained, and the confidence corresponding to each set of key point offsets can be obtained in the matching process.

In an embodiment of the present invention, the screening the at least one group of key points according to the confidence corresponding to each group of key points to obtain the key point detection result of the target object includes: performing non-maximum suppression processing on the at least one group of key points according to the at least one group of key points and the confidence corresponding to each group of key points to obtain a key point detection result of the target object;

the method further comprises the following steps: and determining a target frame according to the key point detection result, wherein the target frame represents the position of the target object.

When at least one group of key points is screened according to the confidence of each group of key points, the at least one group of key points can be screened by carrying out non-maximum suppression processing on the at least one group of key points, namely, the at least one group of key points is directly subjected to the non-maximum suppression processing, so that the key point detection result of the target object is obtained. After the key point detection result of the target object is obtained, a group of key points belonging to the same target object in the key point detection result can be determined, the minimum circumscribed rectangle of the same group of key points is determined, and the minimum circumscribed rectangle is determined as the target frame of the same group of key points. The key point detection result and the corresponding target frame can be displayed simultaneously by determining the target frame.

According to the key point detection method provided by the embodiment, after the feature map of the image to be detected is extracted through the backbone network, the key point detection is performed on the feature map according to the plurality of posture key point templates to obtain at least one group of key points corresponding to the target object, the at least one group of key points are screened to obtain the key point detection result of the target object, and the key point detection results of all the target objects in the image to be detected can be directly determined based on the plurality of posture key point templates, so that the target object in the image to be detected does not need to be positioned at first and then the key point detection is performed, and the detection efficiency of the key points can be improved.

On the basis of the technical scheme, the method for detecting the key points of the feature graph according to the plurality of posture key point templates to obtain at least one group of key points corresponding to the target object comprises the following steps: and performing key point detection on the feature map through a key point detection model according to the plurality of posture key point templates to obtain at least one group of key point offsets corresponding to the target object and confidence degrees corresponding to each group of key point offsets.

The keypoint detection model can be based on the offsets of keypoints in the regression feature map of the multiple pose keypoint templates and keypoints in the pose keypoint templates, and the confidence of the offsets of each group of keypoints is determined.

Inputting the feature map into a key point detection model, wherein each feature point in the feature map is taken as a central point matched with the attitude key point template by the key point detection model, namely, as an anchor point (anchor), each attitude key point template is respectively attached to the anchor point, the key point offset of the key point relative to the anchor point is determined based on regression of each attached attitude key point template, each attitude key point template obtains a group of key point offsets, multiple groups of key point offsets corresponding to the feature points are obtained for multiple attitude key point templates, and meanwhile, the key point detection model outputs corresponding confidence for each group of key point offsets. For example, assuming that the size of the feature map is 5 × 5, that is, the feature map includes 25 feature points and there are 24 pose keypoint templates, 25 × 24 groups of keypoint offsets are obtained after performing regression calculation through the keypoint regression network. In the regression of the keypoint offset, regression may be performed based on the convolution layer.

By respectively taking each feature point of the plurality of posture key point templates as a matching center to be matched with the feature map, the more accurate positions of the key points are returned back and forth, so that when one image comprises a plurality of target objects, the image can be detected at one time, and the detection efficiency is improved.

In some embodiments of the invention, the method further comprises: and training an initial key point detection model based on the posture key point template and the sample image to obtain the key point detection model.

The initial key point detection model can be obtained by randomly initializing network parameters, and the initial key point detection model is trained based on the posture key point template and the sample image to obtain the trained key point detection model.

In some embodiments of the invention, the step of training the keypoint detection model comprises:

acquiring a sample image and a key point label corresponding to a target object in the sample image; extracting image features of the sample image through a backbone network to obtain a sample feature map corresponding to the sample image; based on the attitude key point template, carrying out key point detection on the sample feature map through the initial key point detection model to obtain the predicted offset of a key point set output by the initial key point detection model; and training the initial key point detection model based on the predicted offset, the attitude key point template corresponding to the predicted offset and the key point labels to obtain the trained key point detection model.

The key point labeling may be a labeling of a position coordinate of a key point detected by using another key point detection method, for example, after a target object in a sample image is detected by using a target detection model, the detected target object is subjected to matting, and a coordinate of the key point is detected by using a conventional key point detection model.

After a sample image is obtained, carrying out key point detection on the sample image by means of other key point detection methods to obtain key point labels corresponding to target objects in the sample image, inputting the sample image into a backbone network, carrying out feature extraction on the sample image through the backbone network to obtain a sample feature map corresponding to the sample image, inputting the sample feature map into an initial key point detection model, carrying out key point detection on the sample feature map by the initial key point detection model based on a posture key point template to obtain predicted offsets of a plurality of key point sets, determining key point coordinates in each key point set based on the posture key point template corresponding to the predicted offsets and the predicted offsets, and adjusting network parameters of the initial key point detection model based on the key point coordinates and the key point labels to obtain a trained key point detection model.

On the basis of the technical scheme, the training step of the key point detection model further comprises the step of determining a confidence degree label corresponding to each sample feature point in the sample feature map of each attitude key point template based on the distance/offset between the key point in the attitude key point template and the key point label;

training an initial key point detection model based on the predicted offset, the attitude key point template corresponding to the predicted offset and the key point label to obtain a trained key point detection model, and the method comprises the following steps: and training the initial key point detection model based on the predicted offset, the posture key point template corresponding to the predicted offset, the key point label and the confidence label to obtain the trained key point detection model.

When a plurality of posture key point templates exist, the result of which posture key point template is taken is finally determined to be output, at this time, a classification network is arranged in the key point detection model, the confidence degrees of the plurality of posture key point templates attached to one feature point are predicted through the classification network, and therefore confidence degree labels corresponding to the sample feature graph need to be determined, so that the key point detection model is trained and supervised. And respectively taking each feature point in the sample feature map as a central point matched with the posture key point template, and determining the distance or offset of the key point in the posture key point template relative to the key point label when determining that each posture key point template is respectively attached to the central point, so that the confidence label of each feature point in the sample feature map relative to each posture key point template can be determined based on the distance or the offset. One feature point corresponds to a plurality of pose keypoint templates, and the number of confidence labeling results is the same as that of the pose keypoint templates, for example, when 24 pose keypoint templates exist, one feature point in the sample feature map corresponds to 24 confidence labeling results.

After the confidence degree label of the sample feature map is obtained, the initial key point detection model can be trained based on the predicted offset, the posture key point template corresponding to the predicted offset, the key point label and the confidence degree label, and the training is finished when the training finishing condition is met, so that the trained key point detection model is obtained.

When determining the corresponding confidence label of each pose keypoint template at each sample feature point in the sample feature map based on the offset between the keypoint in the pose keypoint template and the keypoint label, the distance or the offset may be mapped between 0 and 1 to obtain the confidence label of each feature point in the sample feature map relative to each pose keypoint template. In some embodiments, the distance between a keypoint in the pose keypoint template and the keypoint label is the offset between the keypoint in the pose keypoint template and the keypoint label.

The distance or offset between the pose keypoint template and the keypoint label at each feature point can be mapped between 0 and 1 by means of sigmoid and the like, and the obtained value is used as the confidence label of each feature point relative to each pose keypoint template. Since the confidence is lower when the distance is larger and higher when the distance is smaller, the classification label obtained by mapping the distance is smaller, and the classification label obtained by mapping the distance is larger.

In some embodiments of the present invention, the initial keypoint detection model further outputs a prediction confidence corresponding to the prediction offset.

On the basis of the technical scheme, training an initial key point detection model based on the predicted offset, the posture key point template corresponding to the predicted offset, the key point label and the confidence degree label to obtain a trained key point detection model, and the method comprises the following steps:

determining a regression loss value according to the predicted offset, the attitude key point template corresponding to the predicted offset and the key point label; determining a confidence loss value according to the prediction confidence and the confidence label;

and adjusting the network parameters of the initial key point detection model according to the regression loss value and the confidence coefficient loss value to obtain the trained key point detection model.

Obtaining the predicted coordinates of the key points corresponding to the attitude key point template according to the predicted offset and the attitude key point template corresponding to the predicted offset, and substituting the predicted coordinates of the key points and the key point labels corresponding to the sample characteristic diagram into a regression loss function to obtain a regression loss value; and substituting the prediction confidence and confidence label corresponding to each group of prediction offset into a confidence loss function to obtain a confidence loss value. And adding the regression loss value and the confidence coefficient loss value to obtain a target loss value, and adjusting the network parameters of the initial key point detection model based on the target loss value to obtain the trained key point detection model.

In some embodiments of the present invention, the predicted coordinates of the keypoints may be obtained according to the predicted offset output by the keypoint detection model and the pose keypoint template, and an absolute value of a difference between the predicted coordinates of the keypoints and the keypoint labels may be taken as a regression loss function. The confidence Loss function may be a Cross-Entropy Loss function (Cross-Encopy Loss).

The training of the key point detection model is constrained through regression loss and confidence loss, so that the trained key point detection model can accurately give the key points and the corresponding confidence, and the overall detection accuracy can be improved.

Fig. 3 is a flowchart illustrating steps of a method for detecting a keypoint, according to an embodiment of the present invention, as shown in fig. 3, the method may include:

step 301, extracting image characteristics of an image to be detected through a backbone network to obtain a characteristic diagram, wherein the image to be detected comprises a target object.

Step 302, performing key point detection on the feature map according to a plurality of posture key point templates to obtain at least one group of key point offsets corresponding to the target object, wherein the posture key point templates represent the relative position relationship of a plurality of key points in the posture key point templates, and each group of key point offsets represent the offsets of the group of key points in the feature map and the key points in each posture key point template.

The plurality of posture key point templates are the relative position relations of key points corresponding to different postures which are defined in advance, each posture key point template corresponds to a key point under one posture, and the number of the key points in each posture key point template is not necessarily the same due to the fact that the key points are shielded under some postures.

And regressing the key point offsets corresponding to the feature points in the feature map based on the plurality of posture key point templates, wherein one posture key point template can regress a group of key point offsets at one feature point, so that at least one group of key point offsets corresponding to the target object can be obtained for the feature points in the feature map and the plurality of posture key point templates. The coordinates of a set of keypoints may be obtained based on each set of keypoint offsets and the corresponding pose keypoint template.

In an embodiment of the present invention, performing keypoint detection on the feature map according to a plurality of pose keypoint templates to obtain at least one group of keypoint offsets corresponding to the target object, includes: performing key point detection on the feature map according to the plurality of posture key point templates to obtain a plurality of candidate key point offsets and confidence degrees corresponding to the candidate key point offsets; and determining at least one group of key point offsets corresponding to the target object according to the plurality of candidate key point offsets and the confidence degrees corresponding to the candidate key point offsets.

On the basis of a plurality of posture key point templates, the key point offsets corresponding to the feature points in the feature map are regressed, a plurality of groups of key point offsets are obtained on the basis of each feature point, meanwhile, the confidence degrees corresponding to each group of key point offsets can be obtained, the key point offsets are candidate key point offsets, on the basis of the confidence degrees of each group of candidate key point offsets, a combination of the candidate key point offsets with the confidence degrees larger than or equal to a confidence degree threshold value can be screened out from the candidate key point offsets, non-maximum suppression processing is carried out on the screened combination of the candidate key point offsets, and at least one group of key point offsets corresponding to the target object is obtained. By determining at least one group of keypoint offsets corresponding to the target object based on the confidence degrees corresponding to the candidate keypoint offsets, more accurate keypoint offsets can be obtained.

Step 303, obtaining a key point detection result of the target object according to the at least one group of key point offsets.

Because each group of key point offsets represents the offsets of the key points in the group of key point in the feature map and each posture key point template, the key point detection result of the target object can be obtained based on at least one group of key point offsets and the corresponding posture key point template.

In an embodiment of the present invention, obtaining a keypoint detection result of the target object according to the at least one group of keypoint offsets includes:

and determining a key point detection result of the target object according to the at least one group of key point offsets and the posture key point template corresponding to the at least one group of key point offsets.

And adding the coordinates of the key points in the posture key point template corresponding to at least one group of key point offsets, each group of key point offsets and the coordinates of the feature points obtaining the group of key point offsets to obtain the coordinates of the key points of the target object, so as to obtain the key point detection result of the target object.

According to the key point detection method provided by the embodiment, after the feature map of the image to be detected is extracted through the backbone network, the key point detection is performed on the feature map according to the plurality of posture key point templates to obtain at least one group of key point offsets corresponding to the target object, the key point detection result of the target object is determined according to the at least one group of key point offsets, and the key point offsets of all the target objects in the image to be detected can be directly determined based on the plurality of posture key point templates, so that the key point detection results of all the target objects can be obtained based on the key point offsets, the target object in the image to be detected does not need to be positioned first, and then the key point detection is performed, and therefore the detection efficiency of the key point can be improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Fig. 4 is a block diagram of a structure of a keypoint detection apparatus according to an embodiment of the present invention, and as shown in fig. 4, the keypoint detection apparatus may include:

the feature extraction module 401 is configured to extract image features of an image to be detected through a backbone network to obtain a feature map, where the image to be detected includes a target object;

a key point detection module 402, configured to perform key point detection on the feature map according to multiple pose key point templates to obtain at least one group of key points corresponding to the target object, where the pose key point templates represent relative position relationships of multiple key points in the pose key point templates;

the detection result determining module 403 is configured to screen the at least one group of key points to obtain a key point detection result of the target object.

Optionally, the key point detecting module includes:

the key point detection unit is used for detecting key points of the feature map according to the plurality of posture key point templates to obtain a plurality of candidate key point offsets and confidence degrees corresponding to the candidate key point offsets, wherein the candidate key point offsets are the offsets of the feature points in the feature map and the key points in each posture key point template;

and the key point determining unit is used for determining at least one group of key points corresponding to the target object according to the plurality of candidate key point offsets and the confidence degrees.

Optionally, the detection result determining module is specifically configured to:

and determining a key point detection result of the target object according to the at least one group of key points and the confidence corresponding to each group of key points.

Optionally, the detection result determining module includes:

the target frame determining unit is used for respectively determining the target frames corresponding to each group of key points;

and the detection result determining unit is used for carrying out non-maximum suppression processing on the target frames corresponding to the at least one group of key points according to the confidence degrees corresponding to each group of key points to obtain the key point detection result of the target object.

Optionally, the target frame determining unit is specifically configured to:

and respectively determining the minimum external rectangle corresponding to each group of key points as the target frame corresponding to each group of key points.

Optionally, the key point determining unit is specifically configured to:

screening out a combination of the keypoint offsets with the confidence degrees larger than or equal to a confidence degree threshold value from the candidate keypoint offsets, determining the screened combination of the keypoint offsets as at least one group of keypoint offsets, wherein each group of keypoint offsets represents the offsets of the keypoint group in the feature map and the keypoint in each posture keypoint template;

and determining at least one group of key points corresponding to the target object according to the at least one group of key point offsets and the posture key point template corresponding to each group of key point offsets.

Optionally, the key point detecting unit is specifically configured to:

and matching a plurality of the attitude key point templates with the feature map respectively based on each feature point in the feature map, determining the offset of the feature point and the key point in each attitude key point template and the confidence coefficient of the feature point, and obtaining at least one group of key point offsets and the confidence coefficient corresponding to each group of key point offsets.

Optionally, the detection result determining module includes:

the detection result determining unit is used for performing non-maximum suppression processing on the at least one group of key points according to the at least one group of key points and the confidence corresponding to each group of key points to obtain a key point detection result of the target object;

the device further comprises:

and the target frame determining module is used for determining a target frame according to the key point detection result, wherein the target frame represents the position of the target object.

Optionally, the key point detecting module is specifically configured to:

and performing key point detection on the feature map through a key point detection model according to the plurality of posture key point templates to obtain at least one group of key point offsets corresponding to the target object and confidence degrees corresponding to each group of key point offsets.

Optionally, the apparatus further comprises:

and the training module is used for training the initial key point detection model based on the posture key point template and the sample image to obtain the key point detection model.

Optionally, the training module includes:

the system comprises a sample acquisition unit, a target object detection unit and a comparison unit, wherein the sample acquisition unit is used for acquiring a sample image and a key point label corresponding to the target object in the sample image;

the sample feature extraction unit is used for extracting the image features of the sample image through a backbone network to obtain a sample feature map corresponding to the sample image;

the model processing unit is used for carrying out key point detection on the sample feature map through the initial key point detection model based on the attitude key point template to obtain the predicted offset of a key point set output by the initial key point detection model;

and the model training unit is used for training the initial key point detection model based on the predicted offset, the attitude key point template corresponding to the predicted offset and the key point label to obtain the trained key point detection model.

Optionally, the training module further includes:

a confidence label determining unit, configured to determine, based on a distance/offset between a keypoint in the pose keypoint template and the keypoint label, a confidence label corresponding to each pose keypoint template at each sample feature point in a sample feature map;

the model training unit is specifically configured to: and training the initial key point detection model based on the predicted offset, the posture key point template corresponding to the predicted offset, the key point label and the confidence degree label to obtain the trained key point detection model.

Optionally, the initial keypoint detection model further outputs a prediction confidence corresponding to the prediction offset.

Optionally, the model training unit is specifically configured to:

The key point detection device provided by the embodiment extracts the feature map of the image to be detected through the backbone network, performs key point detection on the feature map according to the plurality of posture key point templates to obtain at least one group of key points, screens the at least one group of key points to obtain the key point detection result of the target object, can directly determine the key point detection results of all the target objects in the image to be detected based on the plurality of posture key point templates, does not need to firstly position the target objects in the image to be detected and then perform key point detection, and can improve the detection efficiency of the key points.

Fig. 5 is a block diagram of a structure of a keypoint detection apparatus according to an embodiment of the present invention, and as shown in fig. 5, the keypoint detection apparatus may include:

the feature extraction module 501 is configured to extract image features of an image to be detected through a backbone network to obtain a feature map, where the image to be detected includes a target object;

a keypoint detection module 502, configured to perform keypoint detection on the feature map according to multiple pose keypoint templates to obtain at least one group of keypoint offsets corresponding to the target object, where the pose keypoint templates represent relative position relationships of multiple keypoints in the pose keypoint templates, and each group of keypoint offsets represents offsets of the group of keypoints in the feature map and keypoints in each pose keypoint template;

a detection result determining module 503, configured to obtain a key point detection result of the target object according to the at least one group of key point offsets.

Optionally, the key point detecting module includes:

the key point detection unit is used for detecting key points of the feature map according to the plurality of posture key point templates to obtain a plurality of candidate key point offsets and confidence degrees corresponding to the candidate key point offsets;

and the offset determining unit is used for determining at least one group of key point offsets corresponding to the target object according to the plurality of candidate key point offsets and the confidence degrees corresponding to the candidate key point offsets.

The key point detection device provided by this embodiment extracts the feature map of the image to be detected through the backbone network, performs key point detection on the feature map according to the plurality of posture key point templates, obtains at least one group of key point offsets corresponding to the target object, and determines the key point detection result of the target object according to the at least one group of key point offsets.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Further, according to an embodiment of the present invention, there is provided an electronic device, which may be a computer, a mobile terminal, or the like, including: a processor, a memory and a computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, implements the keypoint detection method of the aforementioned embodiments.

According to an embodiment of the present invention, there is also provided a computer readable storage medium including, but not limited to, a disk memory, a CD-ROM, an optical memory, etc., having a computer program stored thereon, which when executed by a processor, implements the keypoint detection method of the foregoing embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The key point detection method, the key point detection device, the electronic device and the storage medium provided by the invention are introduced in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for detecting a keypoint, comprising:

2. The method according to claim 1, wherein performing keypoint detection on the feature map according to a plurality of pose keypoint templates to obtain at least one group of keypoints corresponding to the target object, comprises:

performing key point detection on the feature map according to a plurality of posture key point templates to obtain a plurality of candidate key point offsets and confidence degrees corresponding to the candidate key point offsets, wherein the candidate key point offsets are the offsets of the feature points in the feature map and the key points in each posture key point template;

and determining at least one group of key points corresponding to the target object according to the candidate key point offsets and the confidence degrees.

3. The method of claim 2, wherein the screening the at least one group of key points to obtain the key point detection result of the target object comprises:

4. The method of claim 3, wherein determining the keypoint detection of the target object based on the at least one set of keypoints and the confidence level associated with each set of keypoints comprises:

respectively determining a target frame corresponding to each group of key points;

and performing non-maximum suppression processing on the target frames corresponding to the at least one group of key points according to the confidence degrees corresponding to each group of key points to obtain the key point detection result of the target object.

5. The method according to claim 4, wherein the separately determining the target boxes corresponding to each group of key points comprises:

6. The method according to any one of claims 2-5, wherein determining at least one set of keypoints for the target object based on the plurality of candidate keypoint offsets and the confidence level comprises:

7. The method according to claim 6, wherein the screening the at least one group of key points to obtain the key point detection result of the target object comprises:

performing non-maximum suppression processing on the at least one group of key points according to the at least one group of key points and the confidence corresponding to each group of key points to obtain a key point detection result of the target object;

the method further comprises the following steps:

and determining a target frame according to the key point detection result, wherein the target frame represents the position of the target object.

8. The method according to any one of claims 2-7, wherein performing keypoint detection on the feature map according to a plurality of pose keypoint templates to obtain a plurality of candidate keypoint offsets and a confidence corresponding to each candidate keypoint offset comprises:

9. The method according to any one of claims 1 to 8, wherein performing keypoint detection on the feature map according to a plurality of pose keypoint templates to obtain at least one group of keypoints corresponding to the target object comprises:

10. The method of claim 8, further comprising: and training an initial key point detection model based on the posture key point template and the sample image to obtain the key point detection model.

11. The method of claim 10, wherein the step of training the keypoint detection model comprises:

acquiring a sample image and a key point label corresponding to a target object in the sample image;

extracting image features of the sample image through a backbone network to obtain a sample feature map corresponding to the sample image;

based on the attitude key point template, carrying out key point detection on the sample feature map through the initial key point detection model to obtain the predicted offset of a key point set output by the initial key point detection model;

and training the initial key point detection model based on the predicted offset, the attitude key point template corresponding to the predicted offset and the key point labels to obtain the trained key point detection model.

12. The method of claim 11, wherein the step of training the keypoint detection model further comprises:

determining a corresponding confidence label of each pose keypoint template at each sample feature point in the sample feature map based on the distance/offset between the keypoint in the pose keypoint template and the keypoint label;

training an initial key point detection model based on the predicted offset, the attitude key point template corresponding to the predicted offset and the key point label to obtain a trained key point detection model, and the method comprises the following steps:

and training the initial key point detection model based on the predicted offset, the posture key point template corresponding to the predicted offset, the key point label and the confidence degree label to obtain the trained key point detection model.

13. The method of claim 11 or 12, wherein the initial keypoint detection model further outputs a prediction confidence corresponding to the prediction offset.

14. The method of claim 12, wherein training an initial keypoint detection model based on the predicted offset, the pose keypoint template corresponding to the predicted offset, the keypoint label, and the confidence label to obtain a trained keypoint detection model comprises:

15. A method for detecting a keypoint, comprising:

16. The method of claim 15, wherein performing keypoint detection on the feature map according to a plurality of pose keypoint templates to obtain at least one group of keypoint offsets corresponding to the target object comprises:

performing key point detection on the feature map according to the plurality of posture key point templates to obtain a plurality of candidate key point offsets and confidence degrees corresponding to the candidate key point offsets;

and determining at least one group of key point offsets corresponding to the target object according to the plurality of candidate key point offsets and the confidence degrees corresponding to the candidate key point offsets.

17. The method according to claim 15 or 16, wherein obtaining the keypoint detection result of the target object according to the at least one set of keypoint offsets comprises:

18. A keypoint detection device, comprising:

19. A keypoint detection device, comprising:

20. An electronic device, comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the keypoint detection method of any of claims 1 to 14 or claims 15 to 17.

21. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements a keypoint detection method according to any one of claims 1 to 14 or claims 15 to 17.