CN114863473B

CN114863473B - Human body key point detection method, device, equipment and storage medium

Info

Publication number: CN114863473B
Application number: CN202210323217.0A
Authority: CN
Inventors: 杨黔生
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2023-06-16
Anticipated expiration: 2042-03-29
Also published as: CN114863473A

Abstract

The disclosure provides a human body key point detection method, a device, equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, and can be applied to scenes such as 3D vision, augmented reality, virtual reality and the like. The specific implementation scheme is as follows: acquiring a video frame sequence to be detected; detecting human body key points of the video frames to be detected in the video frame sequence to be detected, and obtaining vectors and human body key point heat maps between human body key points corresponding to the video frames to be detected; and determining the position information of the human body key points in the video frame to be detected according to the vectors among the human body key points and the human body key point heat map. By the technical scheme, key points of a human body in the video frame can be efficiently and accurately positioned.

Description

Human body key point detection method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, and can be applied to scenes such as 3D vision, augmented reality, virtual reality and the like.

Background

With the development of artificial intelligence technology, industries such as short video, live broadcast, online education and the like are continuously emerging, and in various interaction scenes, the functional requirements for interaction based on human body key point information are more and more. How to accurately and efficiently locate key points of a human body is important.

Disclosure of Invention

The disclosure provides a human body key point detection method, device, equipment and storage medium.

According to an aspect of the present disclosure, there is provided a human body key point detection method, including:

acquiring a video frame sequence to be detected;

detecting human body key points of the video frames to be detected in the video frame sequence to be detected, and obtaining vectors and human body key point heat maps between human body key points corresponding to the video frames to be detected;

and determining the position information of the human body key points in the video frame to be detected according to the vectors among the human body key points and the human body key point heat map.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the human keypoint detection method of any one embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the human body keypoint detection method according to any of the embodiments of the present disclosure.

According to the technology disclosed by the disclosure, the detection accuracy of key points of a human body can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of a human body key point detection method provided according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of another human keypoint detection method provided in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a keypoint detection model provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a flowchart of yet another human keypoint detection method provided in accordance with an embodiment of the present disclosure;

FIG. 5 is a flow chart of yet another human keypoint detection method provided in accordance with an embodiment of the present disclosure;

FIG. 6 is a schematic illustration of key points in a human body diagram structure provided in accordance with an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a human body key point detection device according to an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a human body keypoint detection method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of a method for detecting human keypoints, which is applicable to a case of how to detect human keypoints according to an embodiment of the disclosure. The method can be executed by a human body key point detection device, and the device can be realized in a software and/or hardware mode and can be integrated in an electronic device carrying the human body key point detection function. As shown in fig. 1, the human body key point detection method of the present embodiment may include:

S101, acquiring a video frame sequence to be detected.

In this embodiment, the video frame sequence to be detected is a video frame sequence requiring human body key point detection. The video frame sequence is a sequence formed by each video frame according to the acquisition time.

Specifically, the video frame sequence to be detected can be obtained from videos of interaction scenes such as short videos, live broadcasting, online education and the like.

And S102, detecting human body key points of the to-be-detected video frames in the to-be-detected video frame sequence, and obtaining vectors and human body key point heat maps between the human body key points corresponding to the to-be-detected video frames.

In this embodiment, the human body key points are key points for representing the human body, and may include, but are not limited to, human body center points, head (top of head), nose (or face center), neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, chest, pelvis, left hip, right hip, left knee, left ankle, right knee, right ankle, and other key points.

Furthermore, the connection relationship between key points of the human body can be determined according to the preset connection relationship of the human body structure, for example, the head (top of head) is connected with the nose (or the center of face), the nose (or the center of face) is connected with the neck, the neck is connected with the right shoulder, the neck is connected with the chest, the chest is connected with the center point of the human body, the center point of the human body is connected with the pelvis, and the like.

The vector between the human body key points is used to represent the distance relation between the two connected human body key points, and may be the difference between the position coordinates between the two connected human body key points, such as the difference between the position coordinates between the head and the nose, the difference between the position coordinates between the human body center point and the chest, the difference between the position coordinates between the right shoulder and the right elbow, and the like.

The human body key point heat map is a thermodynamic diagram of two-dimensional positions of human body key points in an image, and comprises two-dimensional position information of the human body key points; alternatively, each human keypoint corresponds to a human keypoint heat map, such as a head corresponding to a head heat map, a human center point corresponding to a human center point heat map, and so forth.

Alternatively, human body key points can be detected on the basis of a key point detection model for the to-be-detected video frames in the to-be-detected video frame sequence, and vectors and human body key point heat maps between human body key points corresponding to the to-be-detected video frames are obtained through model processing. The key point detection model is obtained by training a training sample in advance based on a machine learning algorithm.

And S103, determining the position information of the human body key points in the video frame to be detected according to the vectors among the human body key points and the human body key point heat map.

Specifically, the vector between the human body key points and the human body key point heat map can be processed based on the decoding model to obtain the position information of the human body key points in the video frame to be detected. Wherein the decoding model is pre-trained based on a machine learning algorithm.

According to the technical scheme, human body key points are detected on the to-be-detected video frames in the obtained to-be-detected video frame sequence, vectors and human body key point heat maps between human body key points corresponding to the to-be-detected video frames can be obtained, and then the position information of the human body key points in the to-be-detected video frames can be determined according to the vectors and the human body key point heat maps between the human body key points. According to the technical scheme, vectors among the human body key points corresponding to the video frames are introduced in the human body key point detection process, and compared with the existing human body key point detection modes such as a Gaussian heat diagram mode, the human body key point detection accuracy is improved; furthermore, especially in a multi-person scene, the key points of each human body can be efficiently and accurately positioned by the scheme.

FIG. 2 is a flow chart of another human keypoint detection method provided in accordance with an embodiment of the present disclosure; based on the above embodiment, the present embodiment further optimizes "detecting human body key points for the to-be-detected video frames in the to-be-detected video frame sequence, to obtain vectors and human body key point heat maps between human body key points corresponding to the to-be-detected video frames", and provides an alternative embodiment. As shown in fig. 2, the human body key point detection method of the present embodiment may include:

S201, obtaining a video frame sequence to be detected.

S202, inputting the video frame sequence to be detected into a first feature extraction network in the key point detection model to obtain first features corresponding to the video frames to be detected in the video frame sequence to be detected.

In this embodiment, the keypoint detection model may include a first feature extraction network, a second feature extraction network, and a keypoint detection network. The first feature extraction network is used for extracting high-level semantic features of the video frame to be detected, namely first features, and can be a convolutional neural network (such as ResNet 50) for example; the second feature extraction network is used for extracting inter-frame information of the video frame sequence to be detected, and may be, for example, a bidirectional recurrent neural network (such as bidirectional RNN); the key point detection network is used for extracting vectors between key points of a human body corresponding to the video frame to be detected and a heat map of the key points of the human body, and can be Mask cyclic convolutional neural network (Mask-RCNN) and the like.

In addition, the key point detection model is obtained by training in advance based on training sample data. Specifically, training sample data may be used to perform joint training on the initial first feature extraction network, the initial second feature extraction network, and the initial keypoint detection network, so as to obtain a keypoint detection model. Furthermore, before model training, different scale scaling, rotation angles and disturbance enhancement of color space can be performed on training sample data, so that generalization capability of the model is improved.

Specifically, the video frame sequence to be detected may be input to a first feature extraction network in the key point detection model, and the first feature corresponding to each video frame to be detected in the video frame sequence to be retrieved is obtained through the processing of the first feature extraction network.

And S203, inputting the first features into a second feature extraction network in the key point detection model to obtain target features corresponding to the video frames to be detected.

In this embodiment, compared with the first feature, the target feature can better represent the relevant feature of the human body in the video frame sequence to be detected.

Specifically, the first features corresponding to each video frame to be detected are input to a second feature extraction network in the key point detection model, and target features corresponding to each video frame to be detected can be obtained through the processing of the second feature extraction network.

S204, inputting the target features into a key point detection network in the key point detection model to obtain vectors and human key point heat maps between human key points corresponding to the video frames to be detected.

Specifically, the target feature corresponding to each video frame to be detected can be input to a key point detection network in the key point detection model, and the vector and the human key point heat map between the human key points corresponding to each video frame to be detected can be obtained through the key point detection network processing.

S205, according to the vectors among the human body key points and the human body key point heat map, determining the position information of the human body key points in the video frame to be detected.

According to the technical scheme, a video frame sequence to be detected is obtained, then the video frame sequence to be detected is input into a first feature extraction network in a key point detection model to obtain first features corresponding to the video frames to be detected in the video frame sequence to be detected, the first features are input into a second feature extraction network in the key point detection model to obtain target features corresponding to the video frames to be detected, then the target features are input into the key point detection network in the key point detection model to obtain vectors between human key points corresponding to the video frames to be detected and a human key point heat map, and finally the position information of the human key points in the video frames to be detected is determined according to the vectors between the human key points and the human key point heat map. According to the technical scheme, the features of the video frame to be detected are extracted by adopting the two-level feature extraction network, so that the comprehensiveness and the accuracy of feature extraction are ensured, the determination of the vector between the human body key points and the human body key point heat map is more accurate, and the detection accuracy of the human body key points is further improved.

Because the same key point in different video frames may be lost in a certain video frame when the human body key point detection is performed on the video frame to be detected, the target feature of the video frame to be detected is inaccurate, so that in order to make the target feature corresponding to the video frame to be detected more accurate, as an alternative mode of the embodiment of the disclosure, as shown in fig. 3, the key point detection model includes a first feature extraction network, a second feature extraction network and a key point detection network; further, the second feature extraction network is preferably a bidirectional RNN, and may include a forward feature extraction network and a reverse feature extraction network, where the forward feature extraction network is configured to process, according to a forward sequence (i.e. an acquisition sequence) between video frames to be detected in the video frame sequence to be detected, a first feature corresponding to the video frame to be detected; correspondingly, the reverse feature extraction network is used for processing the first features corresponding to the video frames to be detected according to the reverse sequence among the video frames to be detected in the video frame sequence to be detected.

Correspondingly, the first feature is input into a second feature extraction network in the key point detection model to obtain a target feature corresponding to the video frame to be detected, or the first feature is input into a forward feature extraction network and a reverse feature extraction network respectively to obtain a forward feature and a reverse feature corresponding to the video frame to be detected; and fusing the forward feature and the reverse feature to obtain a target feature corresponding to the video frame to be detected.

Specifically, the first features corresponding to each video frame to be detected can be respectively input into a forward feature extraction network and a direction feature extraction network, and the forward feature and the reverse feature corresponding to each video frame to be detected can be obtained through the processing of the forward feature extraction network and the reverse feature extraction network; and then for each video frame to be detected, the forward feature and the reverse feature corresponding to the video frame to be detected can be fused, for example, the forward feature and the reverse feature can be spliced to obtain the target feature corresponding to the video frame to be detected.

It can be understood that the first feature is processed through the forward feature extraction network and the backward feature extraction network to obtain the target feature, and the inter-frame relationship between video frames in the video sequence to be detected is fully considered, so that the determined target feature is more accurate, and a guarantee is provided for the subsequent positioning of the human body key points.

FIG. 4 is a flowchart of yet another human keypoint detection method provided in accordance with an embodiment of the present disclosure; the present embodiment provides an alternative implementation manner for further optimizing "determining location information of human keypoints in a video frame to be detected according to a vector between human keypoints and a human keypoint heat map" based on the above embodiment. As shown in fig. 4, the method for detecting key points of a human body provided in this embodiment may include:

S401, acquiring a video frame sequence to be detected.

And S402, detecting human body key points of the to-be-detected video frames in the to-be-detected video frame sequence, and obtaining vectors and human body key point heat maps between the human body key points corresponding to the to-be-detected video frames.

S403, determining two-dimensional coordinate information of the human body center point in the video frame to be detected according to the human body center point heat map in the human body key point heat map.

In this embodiment, the human body key points may be divided into human body center points and non-human body center points; alternatively, the human body center point and the non-human body center point may be determined from the human body key points according to the pre-labeled information. For example, one piece of identification information may be allocated to each human body key point, and then, a human body center point and a non-human body center point may be determined according to the identification information of each human body key point, for example, 18 human body key points may be detected, and are respectively denoted by serial numbers 1-18, and if the key point corresponding to the abdomen is preset to be 1, the human body key point (abdomen) identified as 1 is taken as the human body center point, and the other human body key points are non-human body center points. Further, the human body center point heat map is a heat map corresponding to a human body center point, and the non-human body center point heat map is a heat map corresponding to a non-human body center point.

Alternatively, the human body central point heat map can be determined from the human body key point heat map according to the identification information of the human body key points, and then the position information of the human body central point heat map can be used as the two-dimensional coordinate information of the human body central point. It should be noted that, if the video to be detected includes a human body, only one thermodynamic diagram of the human body center point exists in the human body center point thermodynamic diagram; if the video frame to be detected contains a plurality of human bodies, the human body center point heat map comprises a plurality of human body center point heat maps, namely the heat map contains a plurality of sub heat maps, and further the position information of each sub heat map is used as two-dimensional coordinate information of the corresponding human body center point.

S404, determining the two-dimensional coordinate information of the non-human body center point in the video frame to be detected according to the two-dimensional coordinate information of the human body center point, the vectors among the human body key points and the non-human body center point heat map in the human body key point heat map.

Alternatively, the two-dimensional coordinate information of the non-human body center point in the video frame to be detected can be determined based on a preset rule and by combining the human body structure diagram according to the two-dimensional coordinate information of the human body center point, the vector between the human body key points and the non-human body center point heat map in the human body key point heat map.

For example, for each video frame to be detected, if the video frame to be detected only includes a single human body, two-dimensional coordinate information of each non-human body center point in the video frame to be detected can be determined directly according to the heat map of each non-human body center point; or, for each non-human body center point, the human body structure diagram and the preset rule can be combined, the predicted coordinate information of the non-human body center point is calculated according to the vector between the two-dimensional coordinate information of the human body center point and the human body key point, and then the two-dimensional coordinate information of the non-human body center point is determined according to the predicted coordinate information of the non-human body center point and the position information in the non-human body center point heat map corresponding to the non-human body center point. Specifically, a midpoint between the predicted coordinate information of the non-human body center point and the position information in the non-human body center point heat map corresponding to the non-human body center point may be calculated, and the position information of the midpoint is used as the two-dimensional coordinate information of the non-human body center point.

For another example, for each video frame to be detected, if the video frame to be detected contains a plurality of human bodies, for each human body center point, the two-dimensional coordinate information of the non-human body center point corresponding to the human body center point is determined according to the two-dimensional coordinate information of the human body center point, the vectors among the related human body key points and the corresponding non-human body center point heat map.

According to the technical scheme, a video frame sequence to be detected is obtained, then human body key point detection is carried out on the video frames to be detected in the video frame sequence to be detected, vectors between human body key points corresponding to the video frames to be detected and a human body key point heat map are obtained, two-dimensional coordinate information of a human body center point in the video frames to be detected is determined according to the human body center point heat map in the human body key point heat map, and two-dimensional coordinate information of a non-human body center point in the video frames to be detected is determined according to the two-dimensional coordinate information of the human body center point, the vectors between the human body key points and the non-human body center point heat map in the human body key point heat map. According to the technical scheme, the human body center point is taken as the access point, and the human body key points can be positioned efficiently and accurately based on vectors among the human body key points and the like.

Fig. 5 is a flowchart of still another human keypoint detection method provided in accordance with an embodiment of the present disclosure. The embodiment provides an alternative implementation scheme for further optimizing the determination of the two-dimensional coordinate information of the non-human body center point in the video frame to be detected according to the two-dimensional coordinate information of the human body center point, the vector between the human body key points and the non-human body center point heat map in the human body key point heat map based on the embodiment. As shown in fig. 5, the method for detecting key points of a human body provided in this embodiment may include:

S501, acquiring a video frame sequence to be detected.

S502, human body key point detection is carried out on the to-be-detected video frames in the to-be-detected video frame sequence, and vectors and human body key point heat maps between human body key points corresponding to the to-be-detected video frames are obtained.

S503, determining two-dimensional coordinate information of the human body center point in the video frame to be detected according to the human body center point heat map in the human body key point heat map.

S504, dividing the non-human body center point into a first key point and a second key point according to the connection relation between the non-human body center point and the human body center point in the human body key points.

Specifically, a key point connected with a human body center point in non-human body center points is used as a first key point. And taking the key points except the first key point in the non-human body center points, namely the key points which are not directly connected with the human body center points, as second key points.

S505, determining the two-dimensional coordinate information of the first key point according to the two-dimensional coordinate information of the human body center point, the first key point heat map and the vector between the first key point and the human body center point.

In this embodiment, the first keypoint heat map is a heat map corresponding to the first keypoint.

Specifically, according to the two-dimensional coordinate information of the human body center point and the vector between the first key point and the human body center point, the predicted coordinate information of the first key point is determined, and according to the predicted coordinate information of the first key point and the first key point heat map, the two-dimensional coordinate information of the first key point is determined.

S506, determining the two-dimensional coordinate information of the second key point according to the two-dimensional coordinate information of the first key point, the second key point heat map, the vector between the first key point and the second key point and the vector between different second key points.

Specifically, the predicted coordinate information of the second key point connected with the first key point can be determined according to the two-dimensional coordinate information of the first key point and the vector between the first key point and the second key point, and then the two-dimensional coordinate information of the second key point connected with the first key point is determined according to the predicted coordinate information and the position information in the corresponding second key point heat map; further, according to the two-dimensional coordinate information of the second key point, the second key point heat maps of other second key points connected with the second key point and vectors among the second key points and other second key points connected with the second key point, determining the two-dimensional coordinate information of other second key points connected with the second key point; and so on, sequentially determining two-dimensional coordinate information of all remaining second key points.

According to the technical scheme, a video frame sequence to be detected is obtained, then human body key point detection is carried out on the video frames to be detected in the video frame sequence to be detected, a vector between human body key points corresponding to the video frames to be detected and a human body key point heat map are obtained, two-dimensional coordinate information of the human body center point in the video frames to be detected is determined according to the human body center point heat map in the human body key point heat map, the non-human body center point is divided into a first key point and a second key point according to the connection relation between the non-human body center point and the human body center point in the human body key point, two-dimensional coordinate information of the first key point is determined according to the two-dimensional coordinate information of the human body center point, the first key point heat map, the vector between the first key point and the second key point and the vector between different second key points are finally determined. According to the technical scheme, the first key point and the second key point are introduced, so that the coordinate information of the non-human body center point can be more efficiently and accurately determined.

In a specific example, in combination with the human body structure tree diagram shown in fig. 6, a human body center point is set as an abdomen 1, the other key points 2-18 are non-human body center points, wherein the first key point is a chest 2 and a pelvis 3, the other key points are second key points, the human body center point abdomen 1, the first key point is the chest 2, the second key point is the first key point and is a right shoulder 4, a right elbow 5 and a right wrist 6 are taken as examples, and a two-dimensional coordinate information determining process of the human body center point and the non-human body center point is described in detail:

taking the position information in the human body center point heat map corresponding to the abdomen 1 as two-dimensional coordinate information of the abdomen 1 (human body center point), then determining the two-dimensional coordinate information of the chest 2 according to the two-dimensional coordinate information of the abdomen 1, the vector between the chest 2 and the abdomen 1 and the first key point heat map corresponding to the chest 1, further determining the two-dimensional coordinate information of the right shoulder 4 according to the two-dimensional coordinate information of the chest 2 (the first key point), the vector between the chest 2 and the right shoulder 4 and the second key point heat map corresponding to the right shoulder 4, and determining the two-dimensional coordinate information of the right elbow 5 according to the two-dimensional coordinate information of the right shoulder 4, the vector between the right shoulder 4 and the right elbow 5 and the second key point heat map corresponding to the right elbow 5; the two-dimensional coordinate information of the right wrist 6 is determined according to the two-dimensional coordinate information of the right elbow 5, the vector between the right elbow 5 and the right wrist 6, and the second key point heat map corresponding to the right wrist 6. And sequentially determining two-dimensional coordinate information of other first key points and second key points according to the same thought.

On the basis of the above embodiment, as an optional manner of the present disclosure, if the human body center points are at least two, the first key point heat map of the first key point associated with each human body center point includes at least two sub heat maps; further, for the same first keypoints, the first keypoint heatmap of the first keypoints associated with each human body center point is the same. For example, there are two human body center points, which are respectively marked as a center point 1 and a center point 2; the first key point is the chest, the corresponding first key point heat map is the chest heat map, and the chest associated with the central point 1 and the chest associated with the central point 2 are all associated with the chest heat map. Further, the chest heat map includes two sub heat maps, namely, a heat map corresponding to the chest associated with the center point 1 and a heat map corresponding to the chest associated with the center point 2.

Correspondingly, according to the two-dimensional coordinate information of the human body center point, the first key point heat map and the vector between the first key point and the human body center point, the two-dimensional coordinate information of the first key point can be determined, and according to the two-dimensional coordinate information of each human body center point and the vector between the first key point related to the human body center point and the human body center point, the prediction coordinate information of the first key point related to the human body center point can be determined; selecting a target sub-heat map from at least two sub-heat maps of the first key point heat map associated with the human body center point according to the predicted coordinate information; and determining the two-dimensional coordinate information of the first key point associated with the human body center point according to the target sub-heat map.

Specifically, for each human body center point, adding two-dimensional coordinate information of the human body center point and vectors between the first key point associated with the human body center point and the human body center point, and taking the added result as predicted coordinate information of the first key point associated with the human body center point; then, respectively calculating the predicted coordinate information, and taking a sub-heat map corresponding to the minimum distance as a target sub-heat map (namely, a thermodynamic diagram matched with the first key point associated with the human body center point) according to the distance between each sub-heat map of the first key point heat map associated with the human body center point; and then the position information in the target subheat map is used as the two-dimensional coordinate information of the first key point associated with the human body center point.

It can be understood that the coordinate information of each human body key point can be more efficiently and accurately positioned under the multi-person scene.

Further, in the multi-person scenario, the second keypoint heat map of the second keypoint associated with each human body center point also includes at least two sub heat maps. Further, for each human body center point, after the two-dimensional coordinate information of the first key point associated with the human body center point is determined, the two-dimensional coordinate information of the second key point associated with the human body center point can be determined according to the two-dimensional coordinate information of the first key point associated with the human body center point, the second key point heat map of the second key point associated with the human body center point, the vector between the first key point associated with the human body center point and the second key point, and the vector between different second key points associated with the human body center point.

Specifically, for the second key point (i.e., the first sub-key point) directly connected to the first key point, the predicted coordinate information of the second key point associated with the human body center point may be determined according to the two-dimensional coordinate information of the first key point associated with the human body center point and the vector between the second key point associated with the human body center point and the first key point; selecting a target sub-heat map from at least two sub-heat maps of a second key point heat map associated with the human body center point according to the predicted coordinate information; and determining two-dimensional coordinate information of a second key point associated with the human body center point according to the target sub-heat map.

Similarly, the two-dimensional coordinate information of the second sub-key point associated with the human body center point can be determined according to the two-dimensional coordinate information of the first sub-key point associated with the human body center point, the vector among different second key points and the second sub-key point (namely, the second key point which is not directly connected with the first key point) heat map associated with the human body center point.

On the basis of the above embodiment, as an optional manner of the embodiment of the present disclosure, in a 3D scene, performing human body key point detection on a video frame sequence to be detected may be performing human body key point detection on a video frame to be detected in the video frame sequence to be detected, to obtain depth information of a human body center point corresponding to the video frame to be detected, a vector between the human body key points, and a human body key point heat map, and then determining two-dimensional coordinate information of the human body center point in the video frame to be detected according to the human body center point heat map in the human body key point heat map, and further determining three-dimensional coordinate information of the human body center point according to the two-dimensional coordinate information and the depth information of the human body center point; determining the depth information of the non-human body center point according to the vector between the depth information of the human body center point and the human body key point; and determining the three-dimensional coordinate information of the non-human body center point according to the two-dimensional coordinate information and the depth information of the non-human body center point.

It should be noted that, at this time, the vector between the key points of the human body is a three-dimensional vector.

For example, the three-dimensional coordinate information of the human body center point may be determined according to the two-dimensional coordinate information and the depth information of the human body center point, for each video frame to be detected, whether the video frame to be detected contains a single human body or a plurality of human bodies, and for each human body center point, the three-dimensional coordinate information of the human body center point may be determined according to the two-dimensional coordinate information and the depth information of the human body center point. Specifically, the depth information may be used as a z-axis coordinate, and the two-dimensional coordinate information may be used as an x-axis coordinate and a y-axis coordinate, respectively, so as to obtain three-dimensional coordinate information of a human body center point.

For each non-human body center point, the depth information of the non-human body center point can be determined according to the vector between the depth information of the human body center point and the human body key point, and then the three-dimensional coordinate information of the non-human body center point can be determined according to the two-dimensional coordinate information and the depth information of the non-human body center point. For example, the non-human body center point is a chest, a z-axis component is extracted from a vector between the human body center point and the chest, the depth information of the human body center point and the z-axis component are added to be used as the depth information of the chest, and further, the three-dimensional coordinate information of the chest is determined according to the two-dimensional coordinate information and the depth information of the chest, namely, the depth information of the chest is used as the z-axis coordinate in the three-dimensional coordinate information.

It can be understood that the vector between the depth information and the human body key points is introduced to determine the three-dimensional coordinate information of the human body key points, so that the human body key points can be positioned efficiently and accurately in a 3D scene.

It should be noted that, in the 3D scene, the vector between the human body key points is a three-dimensional vector, so when the two-dimensional coordinate information of the non-human body center point is determined, the two-dimensional coordinate information of the non-human body center point is determined according to the two-dimensional coordinate information of the human body center point and the x-axis component and the y-axis component in the vector between the human body key points.

Fig. 7 is a schematic structural diagram of a human body key point detection device according to an embodiment of the present disclosure. The embodiment of the disclosure is suitable for the situation of training a student model based on knowledge distillation technology. The device can be implemented by software and/or hardware, and can implement the human body key point detection method of any embodiment of the disclosure. As shown in fig. 7, the human body key point detection apparatus 700 includes:

a video frame sequence acquisition module 701, configured to acquire a video frame sequence to be detected;

the key point detection module 702 is configured to perform human key point detection on a to-be-detected video frame in a to-be-detected video frame sequence, so as to obtain a vector between human key points corresponding to the to-be-detected video frame and a human key point heat map;

The position information determining module 703 is configured to determine position information of the human body key points in the video frame to be detected according to the vectors between the human body key points and the human body key point heat map.

According to the technical scheme, the to-be-detected video frame sequence is obtained, then human body key point detection is carried out on to-be-detected video frames in the to-be-detected video frame sequence, vectors between human body key points corresponding to the to-be-detected video frames and a human body key point heat map are obtained, and then position information of the human body key points in the to-be-detected video frames is determined according to the vectors between the human body key points and the human body key point heat map. According to the technical scheme, vectors among the human body key points corresponding to the video frames are introduced in the human body key point detection process, and compared with the existing human body key point detection modes such as a Gaussian heat diagram mode, the human body key point detection accuracy is improved; furthermore, especially in a multi-person scene, the scheme can efficiently and accurately locate key points of each human body.

Further, the keypoint detection module 702 includes:

the first feature determining unit is used for inputting the video frame sequence to be detected into a first feature extraction network in the key point detection model to obtain first features corresponding to the video frames to be detected in the video frame sequence to be detected;

The target feature determining unit is used for inputting the first feature into a second feature extraction network in the key point detection model to obtain a target feature corresponding to the video frame to be detected;

and the heat map determining unit is used for inputting the target characteristics into a key point detection network in the key point detection model to obtain vectors among the key points of the human body corresponding to the video frame to be detected and a heat map of the key points of the human body.

Further, the second feature extraction network in the keypoint detection model includes a forward feature extraction network and a reverse feature extraction network;

correspondingly, the target feature determining unit is specifically configured to:

respectively inputting the first features into a forward feature extraction network and a reverse feature extraction network to obtain forward features and reverse features corresponding to the video frames to be detected;

and fusing the forward feature and the reverse feature to obtain a target feature corresponding to the video frame to be detected.

Further, the location information determining module 703 includes:

the first coordinate determining unit is used for determining two-dimensional coordinate information of a human body center point in a video frame to be detected according to a human body center point heat map in the human body key point heat map;

the second coordinate determining unit is used for determining the two-dimensional coordinate information of the non-human body center point in the video frame to be detected according to the two-dimensional coordinate information of the human body center point, the vectors among the human body key points and the non-human body center point heat map in the human body key point heat map.

Further, the second coordinate determination unit includes:

the key point dividing sub-unit is used for dividing the non-human body center point into a first key point and a second key point according to the connection relation between the non-human body center point and the human body center point in the human body key points;

the first coordinate determining subunit is used for determining the two-dimensional coordinate information of the first key point according to the two-dimensional coordinate information of the human body center point, the first key point heat map and the vector between the first key point and the human body center point;

and the second coordinate determining subunit is used for determining the two-dimensional coordinate information of the second key point according to the two-dimensional coordinate information of the first key point, the second key point heat map, the vector between the first key point and the second key point and the vector between different second key points.

Further, if the number of the human body center points is at least two, the first key point heat map of the first key point associated with each human body center point comprises at least two sub heat maps;

correspondingly, the first coordinate determination subunit is specifically configured to:

according to the two-dimensional coordinate information of each human body center point and the vector between the first key point associated with the human body center point and the human body center point, determining the predicted coordinate information of the first key point associated with the human body center point;

Selecting a target sub-heat map from at least two sub-heat maps of the first key point heat map associated with the human body center point according to the predicted coordinate information;

and determining the two-dimensional coordinate information of the first key point associated with the human body center point according to the target sub-heat map.

Further, the location information determining module further includes:

the third coordinate determining unit is used for determining three-dimensional coordinate information of the human body center point according to the two-dimensional coordinate information and the depth information of the human body center point;

a depth information determining unit for determining depth information of a non-human body center point according to the depth information of the human body center point and a vector between human body key points;

and the fourth coordinate determining unit is used for determining the three-dimensional coordinate information of the non-human body center point according to the two-dimensional coordinate information and the depth information of the non-human body center point.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related video frame sequences and the like all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a human body key point detection method. For example, in some embodiments, the human keypoint detection method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the human body keypoint detection method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the human keypoint detection method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligent software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

Cloud computing (cloud computing) refers to a technical system that a shared physical or virtual resource pool which is elastically extensible is accessed through a network, resources can comprise servers, operating systems, networks, software, applications, storage devices and the like, and resources can be deployed and managed in an on-demand and self-service mode. Through cloud computing technology, high-efficiency and powerful data processing capability can be provided for technical application such as artificial intelligence and blockchain, and model training.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A human body key point detection method comprises the following steps:

acquiring a video frame sequence to be detected;

determining two-dimensional coordinate information of a human body center point in the video frame to be detected according to a human body center point heat map in the human body key point heat map;

dividing the non-human body center point into a first key point and a second key point according to the connection relation between the non-human body center point and the human body center point in the human body key points;

according to the two-dimensional coordinate information of the human body center point and the vector between the first key point and the human body center point, determining the predicted coordinate information of the first key point;

Determining two-dimensional coordinate information of the first key point according to the predicted coordinate information of the first key point and the first key point heat map;

determining two-dimensional coordinate information of a second key point connected with the first key point according to the two-dimensional coordinate information of the first key point, the vector between the first key point and the second key point and a second key point heat map of the second key point connected with the first key point; determining two-dimensional coordinate information of other second key points connected with the second key point according to the two-dimensional coordinate information of the second key point, the second key point heat map of other second key points connected with the second key point and vectors between the second key point and other second key points connected with the second key point; and so on, sequentially determining two-dimensional coordinate information of all remaining second key points.

2. The method of claim 1, wherein the performing human body keypoint detection on the to-be-detected video frames in the to-be-detected video frame sequence to obtain a vector and a human body keypoint heat map between human body keypoints corresponding to the to-be-detected video frames, comprises:

inputting the video frame sequence to be detected into a first feature extraction network in a key point detection model to obtain first features corresponding to the video frames to be detected in the video frame sequence to be detected;

Inputting the first features into a second feature extraction network in the key point detection model to obtain target features corresponding to the video frames to be detected;

and inputting the target features into a key point detection network in the key point detection model to obtain vectors and human key point heat maps between human key points corresponding to the video frames to be detected.

3. The method of claim 2, wherein the second feature extraction network in the keypoint detection model comprises a forward feature extraction network and a reverse feature extraction network;

correspondingly, the inputting the first feature into the second feature extraction network in the key point detection model to obtain the target feature corresponding to the video frame to be detected includes:

respectively inputting the first features into the forward feature extraction network and the reverse feature extraction network to obtain forward features and reverse features corresponding to the video frames to be detected;

4. The method of claim 1, wherein if the human body center points are at least two, the first keypoint heat map of the first keypoint associated with each human body center point comprises at least two sub heat maps;

Correspondingly, the determining the two-dimensional coordinate information of the first key point according to the predicted coordinate information of the first key point and the first key point heat map includes:

selecting a target sub-heat map from at least two sub-heat maps of the first key point heat map associated with the human body center point according to the predicted coordinate information of the first key point associated with each human body center point;

5. The method of claim 1, further comprising:

determining three-dimensional coordinate information of the human body center point according to the two-dimensional coordinate information and the depth information of the human body center point;

determining the depth information of the non-human body center point according to the depth information of the human body center point and the vector between the human body key points;

and determining the three-dimensional coordinate information of the non-human body center point according to the two-dimensional coordinate information and the depth information of the non-human body center point.

6. A human body keypoint detection device comprising:

the video frame sequence acquisition module is used for acquiring a video frame sequence to be detected;

the key point detection module is used for detecting human key points of the video frames to be detected in the video frame sequence to be detected, and obtaining vectors and human key point heat maps between the human key points corresponding to the video frames to be detected;

A location information determination module comprising:

the first coordinate determining unit is used for determining two-dimensional coordinate information of a human body center point in the video frame to be detected according to the human body center point heat map in the human body key point heat map;

a second coordinate determination unit including:

a key point dividing sub-unit, configured to divide a non-human body center point of the human body key points into a first key point and a second key point according to a connection relationship between the non-human body center point and the human body center point;

a first coordinate determining subunit, configured to determine predicted coordinate information of a first key point according to two-dimensional coordinate information of a human body center point and a vector between the first key point and the human body center point; determining two-dimensional coordinate information of the first key point according to the predicted coordinate information of the first key point and the first key point heat map;

a second coordinate determining subunit, configured to determine two-dimensional coordinate information of a second key point connected to the first key point according to the two-dimensional coordinate information of the first key point, a vector between the first key point and the second key point, and a second key point heat map of the second key point connected to the first key point; determining two-dimensional coordinate information of other second key points connected with the second key point according to the two-dimensional coordinate information of the second key point, the second key point heat map of other second key points connected with the second key point and vectors between the second key point and other second key points connected with the second key point; and so on, sequentially determining two-dimensional coordinate information of all remaining second key points.

7. The apparatus of claim 6, wherein the keypoint detection module comprises:

the first feature determining unit is used for inputting the video frame sequence to be detected into a first feature extraction network in a key point detection model to obtain a first feature corresponding to a video frame to be detected in the video frame sequence to be detected;

and the heat map determining unit is used for inputting the target characteristics into a key point detection network in the key point detection model to obtain vectors and a human key point heat map between human key points corresponding to the video frame to be detected.

8. The apparatus of claim 7, wherein the second feature extraction network in the keypoint detection model comprises a forward feature extraction network and a reverse feature extraction network;

9. The apparatus of claim 6, wherein if the human body center points are at least two, the first keypoint heat map of the first keypoint associated with each human body center point comprises at least two sub heat maps;

10. The apparatus of claim 6, wherein the location information determination module further comprises:

a third coordinate determining unit, configured to determine three-dimensional coordinate information of the human body center point according to the two-dimensional coordinate information and the depth information of the human body center point;

a depth information determining unit, configured to determine depth information of the non-human body center point according to the depth information of the human body center point and a vector between the human body key points;

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the human keypoint detection method of any one of claims 1-5.

12. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the human body keypoint detection method according to any one of claims 1 to 5.