CN114005167A

CN114005167A - Remote sight estimation method and device based on human skeleton key points

Info

Publication number: CN114005167A
Application number: CN202111473575.1A
Authority: CN
Inventors: 赵思源; 彭春蕾; 胡瑞敏; 刘德成; 苗紫民; 万爽; 孙飞洋; 郭荟青; 罗肖怡
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-02-01

Abstract

The invention discloses a remote sight line estimation method and a device based on human skeleton key points, wherein the method comprises the following steps: separating pedestrians in the image to be detected, and cutting out a human body boundary frame image; inputting the human body boundary frame image into a pre-trained human body key point detection network model to obtain position coordinates of a plurality of human body key points in the human body boundary frame image, wherein the human body key points at least comprise a left eye, a right eye, a left ear, a right ear, a nose, a left shoulder and a right shoulder; obtaining an initial human face orientation angle according to the position coordinates of the plurality of human key points; and obtaining the sight line estimation falling point coordinates by using the initial human face orientation angle. The invention can satisfactorily identify and estimate the far-distance pedestrian sight line in a real scene and a game scene.

Description

Remote sight estimation method and device based on human skeleton key points

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a remote sight estimation method and device based on human skeleton key points.

Background

The human being can quickly and easily understand and interpret the orientation and movement of the head, thereby utilizing this important behavioral language to quickly infer the intentions of other people in the vicinity. With the continuous development of society, the scenes of multiple people are more and more, the data recorded by a machine are more and more, and in order to solve the safety problem caused by a large number of images or videos, the ability of rapidly deducing the sight line of a pedestrian in the images or videos becomes more and more important.

At present, the sight line detection method can be roughly divided into two methods according to the characteristics of sight line estimation. One gaze detection method is a face keypoint-based method that performs alignment by establishing correspondences between keypoints and 3D head models and restores the ability to 3D pose of the head, the accuracy of which depends on whether enough face keypoints and corresponding 3D head models can be detected. The other method is a sight line detection method based on head features, which extracts relevant head texture features for analysis to achieve the purpose of detecting sight lines, for example, extracting and detecting head features of each attitude angle left by a CNN (Convolutional Neural Network) model, and the accuracy of the method depends on the head features extracted by the Network. Head pose estimation has an inherent link to gaze estimation, namely the ability to characterize the direction of a person's eye gaze focus. As such, in cases where the human eye is not visible (e.g., low resolution, presence of occlusion), the head pose estimation provides only a rough characterization. In the context of computer vision, head pose estimation is most often interpreted as the ability to infer the orientation of a person's head relative to a camera. It is generally believed that the human head can be modeled as a rigid, solid-free object, under the assumption that the pose of the human head is limited to three degrees of freedom, which are typically pitch, roll, and yaw. Under the deep learning method, the head pose angle is predicted directly from the image features using a multi-loss network with one loss for each angle, i.e., three separate losses, each with two components: pose classification and regression. After training, the network can directly predict the head pose angle after inputting the image features, and further can be visualized as head pose estimation.

However, when environmental factors and video quality factors change, the accuracy rate based on the key points of the face and the head features is reduced, for example, when the face is in a blurred state at a long distance and at a low resolution, the first method cannot effectively detect the face, the number of the key points is seriously insufficient, the key points cannot be aligned with the 3D head model, and thus the sight line focus of the head cannot be detected, and on the other hand, the 3D head model is customized, and when the detected key points are seriously inconsistent with the 3D head model, the accuracy of head sight line estimation is also reduced. The facial features are seriously insufficient, the head pose estimation cannot be carried out on pedestrians in a long-distance low-resolution scene, and the features under all pose angles cannot be distinguished, because partial features of the head disappear under the long-distance low-resolution condition, and the result is that the overall accuracy is low.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a remote sight line estimation method and device based on human skeleton key points. The technical problem to be solved by the invention is realized by the following technical scheme:

one aspect of the present invention provides a method for estimating a remote visual line based on key points of human bones, comprising:

s1: separating pedestrians in the image to be detected, and cutting out a human body boundary frame image;

s2: inputting the human body boundary frame image into a pre-trained human body key point detection network model to obtain position coordinates of a plurality of human body key points in the human body boundary frame image, wherein the human body key points at least comprise a left eye, a right eye, a left ear, a right ear, a nose, a left shoulder and a right shoulder;

s3: obtaining an initial human face orientation angle according to the position coordinates of the plurality of human key points;

s4: and obtaining the sight line estimation falling point coordinates by using the initial human face orientation angle.

In an embodiment of the present invention, the S1 includes:

separating all pedestrians in the image to be detected by using a human body detector, cutting out the human body boundary frame image of each pedestrian, and obtaining the upper left corner coordinate and the lower right corner coordinate of the human body boundary frame image.

In an embodiment of the present invention, the S3 includes:

s31: extracting the position coordinates of the key points of the human body, comprising the following steps: left eye (x)₁,y₁) Right eye (x)₂,y₂) Left ear (x)₃,y₃) Right ear (x)₄,y₄) Nose (x)₅,y₅) Left shoulder (x)₆,y₆) And right shoulder (x)₇,y₇)；

S32: selecting a coordinate origin (x) in the human body boundary frame image according to the position coordinates of the human body key points₀,y₀)；

S33: using the origin of coordinates (x)₀,y₀) Obtaining an initial human face orientation angle:

head_angel1＝arctan(y₀/x₀)。

in an embodiment of the present invention, the S32 includes:

s321: calculating the distance l between the left ear and the left eye_distDistance r from right ear to right eye_distAnd the midpoint coordinates ms of the left shoulder and the right shoulder;

s322: judgment of l_distAnd r_distAnd choosing the origin of coordinates (x)₀,y₀) If l is_dist>r_distThen x₀＝ms_x-x₅,y₀＝y₃-y₅If l is_dist<r_distThen x₀＝ms_x-x₅,y₀＝y₄-y₅Wherein ms is_xThe abscissa indicates the midpoint of the left and right shoulders.

In an embodiment of the present invention, the S4 includes:

s41: equally dividing the human face turning range into a plurality of angle ranges;

s42: judging the initial human face orientation angle head_angel1The falling angle range is selected, and the median of the current angle range is selected as the final human face orientation angle head_angel2；

S43: and obtaining the sight line estimation falling point coordinates of the human body according to the selected final face orientation angle of the human body.

In an embodiment of the present invention, the S43 includes:

according to the selected final human face orientation angle head_angle2And the set visual middle sight length L, and obtaining the relative coordinate (x) of the sight estimation falling point with the nose as the origin₈,y₈) The calculation formula is as follows:

according to the relative coordinate (x)₈,y₈) And calculating the coordinates (x, y) of the sight line estimation landing point:

x＝x₅+x₈,y＝y₅+y₈。

another aspect of the present invention provides a remote visual line estimation apparatus based on key points of human bones, comprising:

the human body detector is used for separating the pedestrians in the image to be detected and cutting out a human body boundary frame image;

the human body key point detection module is used for inputting the human body boundary frame image into a pre-trained human body key point detection network model to obtain position coordinates of a plurality of human body key points in the human body boundary frame image, wherein the human body key points at least comprise a left eye, a right eye, a left ear, a right ear, a nose part, a left shoulder part and a right shoulder part;

the face orientation angle acquisition module is used for acquiring an initial human face orientation angle according to the position coordinates of the plurality of human key points;

and the remote sight line estimation module is used for obtaining sight line estimation falling point coordinates by utilizing the initial human face orientation angle.

In an embodiment of the present invention, the human key point detection module includes a pre-trained human key point detection network model.

In an embodiment of the present invention, the face orientation angle obtaining module is specifically configured to:

extracting position coordinates of key points of the human body; selecting a coordinate origin in the human body boundary frame image according to the position coordinates of the human body key points; and obtaining an initial human face orientation angle by using the coordinate origin.

In an embodiment of the present invention, the remote gaze estimation module is specifically configured to:

equally dividing the human face turning range into a plurality of angle ranges; judging the angle range in which the initial human face orientation angle falls, and selecting the median of the current angle range as the final human face orientation angle; and obtaining the sight line estimation falling point coordinates of the human body according to the selected final face orientation angle of the human body.

Compared with the prior art, the invention has the beneficial effects that:

the remote sight line estimation device based on the human skeleton key points can analyze the remote pedestrians by adding the operation of the skeleton key points, for example, the action tracks of the pedestrians can be deduced through different head postures, the attention of the pedestrians to different areas can be judged due to the head orientations, and the remote sight line of the pedestrians can be satisfactorily recognized and estimated in a real scene and a game scene. Secondly, because the network model is detected by using the human body key points which are pre-trained, the method does not relate to the training of the network model in the execution process, has the advantages of low consumption of computing resources and suitability for various computers, and the test video can achieve the real-time effect.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

FIG. 1 is a flowchart of a method for estimating a distance vision based on key points of human bones according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a human body detector separating all pedestrians in an image to be detected according to an embodiment of the present invention;

FIG. 3 is a diagram of a cropped human body bounding box image according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of key points of a human bone according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of detecting key points of human bones in a boundary image of a human body according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a process for calculating an orientation angle of a human face according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a gaze focus calculation process provided by an embodiment of the invention;

FIG. 8 is a block diagram of a remote visual line estimation apparatus based on human skeleton key points according to an embodiment of the present invention;

fig. 9 is a violin diagram of a three-part video result provided by an embodiment of the present invention.

Detailed Description

In order to further explain the technical means and effects of the present invention adopted to achieve the predetermined invention purpose, the following detailed description is made on a remote visual line estimation method and device based on human skeleton key points according to the present invention with reference to the accompanying drawings and the detailed implementation.

The foregoing and other technical matters, features and effects of the present invention will be apparent from the following detailed description of the embodiments, which is to be read in connection with the accompanying drawings. The technical means and effects of the present invention adopted to achieve the predetermined purpose can be more deeply and specifically understood through the description of the specific embodiments, however, the attached drawings are provided for reference and description only and are not used for limiting the technical scheme of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or device that comprises a list of elements does not include only those elements but may include other elements not expressly listed. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of additional like elements in the article or device comprising the element.

Example one

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for estimating a long-distance visual line based on key points of human bones according to an embodiment of the present invention, where as shown in the figure, the method for estimating a long-distance visual line according to the embodiment includes:

s1: and separating the pedestrians in the image to be detected, and cutting out the human body boundary frame image.

Specifically, all pedestrians in the image to be detected are separated by the human body detector, so that a human body boundary frame image of each pedestrian and upper left corner coordinates and lower right corner coordinates of the human body boundary frame image are obtained, and as shown in fig. 2, the upper left corner coordinates and the lower right corner coordinates can be used for limiting the uppermost boundary, the leftmost boundary, the lowermost boundary and the rightmost boundary of the human body boundary frame image. Then, the human body boundary frame image of each pedestrian is cut out from the human body boundary frame image by using the upper left corner coordinate and the lower right corner coordinate of the human body boundary frame image, as shown in fig. 3. The position coordinates in the present embodiment are pixel positions indicating the points.

S2: inputting the human body boundary frame image into a pre-trained human body key point detection network model to obtain position coordinates of a plurality of human body key points in the human body boundary frame image, wherein the human body key points at least comprise a left eye, a right eye, a left ear, a right ear, a nose, a left shoulder and a right shoulder.

In this embodiment, the human body key point detection network model selects an Alpha pos network, the Alpha pos network is an existing neural network for multi-person posture recognition, positions of human body key points can be predicted, multiple groups of human body key points can be predicted, each group has a score, the score with the highest score is selected as a final output result of coordinates of the human body key points, and detailed descriptions are omitted here. It should be noted that the human body key point detection network model outputs 17 human body bone key points as shown in fig. 4, which includes: left and right eyes, nose, left and right ears, left and right shoulders, left and right elbows, left and right hands, left and right crotch, left and right knees, and left and right ankles. The human body key points used in the embodiment of the invention comprise a left eye, a right eye, a left ear, a right ear, a nose, a left shoulder and a right shoulder. For the case that part of the human key points are unclear or are occluded, the human key point detection network model can also estimate the position coordinates of other key points as required according to the positions and coordinates of the known key points. In other embodiments, other detection networks capable of obtaining the human body key points may be used for processing, which is not limited herein.

S3: and obtaining an initial human face orientation angle according to the position coordinates of the plurality of human key points.

In this embodiment, step S3 specifically includes:

s31: extracting the position coordinates of the human key points in the current human boundary frame image from the human key point detection network model, wherein the position coordinates comprise: left eye (x)₁,y₁) Right eye (x)₂,y₂) Left ear (x)₃,y₃) Right ear (x)₄,y₄) Nose (x)₅,y₅) Left shoulder (x)₆,y₆) And right shoulder (x)₇,y₇)；

S32: selecting a coordinate origin (x) in the human body boundary frame image according to the position coordinates of the human body key points₀,y₀)。

Specifically, the S32 includes:

s321: calculating the distance l between the left ear and the left eye_distDistance r from right ear to right eye_distAnd the midpoint coordinates ms of the left shoulder and the right shoulder:

l_dist＝(x₁-x₃)²+(y₁-y₃)²

r_dist＝(x₂-x₄)²+(y₂-y₄)²

s322: judgment of l_distAnd r_distAnd choosing the origin of coordinates (x)₀,y₀) Selecting x by using coordinates of middle points of left and right shoulders and coordinates of nose₀Selecting y by using the distance between the ear coordinate and the eye coordinate₀If l is_dist>r_distThen x₀＝ms_x-x₅,y₀＝y₃-y₅If l is_dist<r_distThen x₀＝ms_x-x₅,y₀＝y₄-y₅Wherein ms is_xAbscissa representing the midpoint of the left and right shoulders, thereby obtaining origin of coordinates O (x)₀,y₀) As shown in fig. 6.

head_angel1＝arctan(y₀/x₀)。

In this embodiment, the S4 includes:

s41: the human face turning range is equally divided into a plurality of angle ranges.

Specifically, the human head turning range is equally divided into 12 intervals, namely within 360 degrees, every 30 degrees: [ -180, -150], [ -150, -120], [ -120, -90], [ -90, -60], [ -60, -30], [ -30,0], [0,30], [30,60], [60,90], [90,120], [120,150] and [150,180], followed by determining the landing position in which range the initial human face orientation angle falls.

If the pedestrian in the image to be detected is in a long distance, the influence of the irregular rotation factor of the head of the pedestrian exists, and if the calculated initial human face orientation angle head is directly used_angel1The method can cause the problems of overlarge change of the orientation angle between video frames and unfavorable visual impression, and in addition, the method does not need to have very high accuracy on the pedestrians at long distances, so the embodiment adopts a median method to adjust the orientation angle of the human face.

Specifically, two-dimensional space coordinates of a specific sight focus are obtained after dividing the sight range, wherein if the initial human face faces towards the angle head_angel1If the human face falls into a certain sight line range, selecting the middle angle of the range as the final human face orientationAngle head_angel2For example, if the initial human face orientation angle head calculated in step S33_angel1When 12, it falls into [0,30]]Within the range of degree, selecting [0,30]]The middle angle in the range of degrees, i.e., 15 °, is taken as the head-facing angle.

According to the selected final human face orientation angle head_angle2And the set visual middle sight length L, and obtaining the relative coordinate (x) of the sight estimation falling point with the nose as the origin₈,y₈) As shown in fig. 7, the calculation formula is:

preferably, L ═ 20. Then, according to the relative coordinates (x)₈,y₈) And calculating the coordinates (x, y) of the sight line estimation landing point:

x＝x₅+x₈,y＝y₅+y₈。

the remote sight line estimation device based on the human skeleton key points can analyze the remote pedestrians due to the addition of the operation of the skeleton key points, for example, the action tracks of the pedestrians can be deduced through different head postures, the attention of the pedestrians to different areas can be judged due to the head orientations, and the remote sight line of the pedestrians can be satisfactorily recognized and estimated in real scenes and game scenes. Secondly, because the network model is detected by using the human body key points which are pre-trained, the method does not relate to the training of the network model in the execution process, has the advantages of low consumption of computing resources and suitability for various computers, and the test video can achieve the real-time effect.

Example two

On the basis of the above embodiments, the present embodiment provides a remote sight line estimation apparatus based on human skeleton key points, as shown in fig. 8, the remote sight line estimation apparatus of the present embodiment includes a human body detector 1, a human body key point detection module 2, a face orientation angle acquisition module 3, and a remote sight line estimation module 4. The human body detector 1 is used for separating pedestrians in the image to be detected and cutting out a human body boundary frame image. Specifically, the human body detector 1 can separate all pedestrians in the image to be detected, obtain the human body boundary frame image of each pedestrian and the upper left corner coordinate and the lower right corner coordinate of the human body boundary frame image, and then cut out the human body boundary frame image of each pedestrian from the human body boundary frame image by utilizing the upper left corner coordinate and the lower right corner coordinate of the human body boundary frame image.

The human body key point detection module 2 is configured to input the human body boundary frame image into a pre-trained human body key point detection network model to obtain position coordinates of a plurality of human body key points in the human body boundary frame image, where the human body key points at least include a left eye, a right eye, a left ear, a right ear, a nose, a left shoulder, and a right shoulder. The human body key point detection module comprises a pre-trained human body key point detection network model. In this embodiment, the human body key point detection network model is an Alpha pos network.

The face orientation angle obtaining module 3 is configured to obtain an initial human face orientation angle according to the position coordinates of the plurality of human key points.

Further, the face orientation angle obtaining module 3 is specifically configured to: extracting position coordinates of key points of the human body; selecting a coordinate origin in the human body boundary frame image according to the position coordinates of the human body key points; and obtaining an initial human face orientation angle by using the coordinate origin.

And the remote sight line estimation module 4 is used for obtaining sight line estimation falling point coordinates by utilizing the initial human face orientation angle. Further, the far distance gaze estimation module 4 is specifically configured to: equally dividing the human face turning range into a plurality of angle ranges; judging the angle range in which the initial human face orientation angle falls, and selecting the median of the current angle range as the final human face orientation angle; and obtaining the sight line estimation falling point coordinates of the human body according to the selected final face orientation angle of the human body.

The following verification and explanation of the remote sight line estimation method based on human skeleton key points provided by the experimental example of the invention are performed through simulation experiments.

(1) Simulation conditions

In order to verify the effect of the above method, the present embodiment obtains data sets of multiple viewing angles and multiple recording environments, where the multiple viewing angles include: head-up and look-down, multiple recording environments include: hand-held camera (i.e. the camera moves with the person), on-vehicle camera (i.e. the camera moves with the car) and fixed position camera, and multiple scenes include: the method comprises the steps of manually collecting a data set in a real scene and a data set in a game scene, wherein the data set comprises M frames of video images, and M is a natural number greater than 0.

The simulation is carried out by using Pytrch 1.7, and the MOT17 in the real scene of the source data set, partial scene videos in the MTA in the monitoring scene in the game and monitoring view videos used in news reports disclosed in the trembles are adopted in the data set.

The results of the simulation experiments of this example were in the form of questionnaires to arrive at the final performance results. Specifically, a total of three test video sources were employed in the questionnaire, specifically including: jittering, MTA public data set, MOT public data set, where each part contains 5 short videos, each video being approximately 10 seconds or so in length. Evaluation indexes are as follows: and evaluating the performance of the remote pedestrian sight estimation in each video by 0-10 points, wherein the higher the score is, the better the performance of the remote pedestrian sight estimation is and the higher the accuracy is. The persons who answer the questionnaire are both experts and researchers with a scientific background in the field.

Referring to table 1, the average score of the three videos is: 7.38, 7.12, 7.51, standard deviation of 1.27, 1.36, 1.06, variance of 1.59, 1.82, 1.11.

TABLE 1 Scoring results for three-part videos

Referring to fig. 9, the three videos are visualized as violin diagrams for analysis, and the result shows that indexes of median, confidence interval and quartering range all reach excellent indexes.

According to experimental results, the remote sight line estimation method based on the human skeleton key points achieves satisfactory recognition performance for remote pedestrian sight line estimation in both a real scene and a game scene.

In the embodiments provided in the present invention, it should be understood that the apparatus and method disclosed in the present invention can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a form of hardware and a software functional module.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A remote sight line estimation method based on human skeleton key points is characterized by comprising the following steps:

2. The method for remote gaze estimation based on human skeletal keypoints according to claim 1, wherein the S1 comprises:

3. The method for remote gaze estimation based on human skeletal keypoints according to claim 1, wherein the S3 comprises:

head_angel1＝arctan(y₀/x₀)。

4. the method for remote gaze estimation based on human skeletal keypoints according to claim 3, wherein the S32 comprises:

5. The method for remote gaze estimation based on human skeletal keypoints according to claim 3, wherein the S4 comprises:

6. The method for remote gaze estimation based on human skeletal keypoints according to claim 5, wherein the S43 comprises:

according to the relative coordinate (x)₈,y₈) ComputingAnd (3) obtaining coordinates (x, y) of the sight line estimation landing point:

x＝x₅+x₈,y＝y₅+y₈。

7. a remote sight line estimation device based on human skeleton key points is characterized by comprising:

8. The device according to claim 7, wherein the human key point detection module comprises a pre-trained human key point detection network model.

9. The device according to claim 7, wherein the face orientation angle acquisition module is specifically configured to:

10. The device according to any one of claims 7 to 9, wherein the remote gaze estimation module is specifically configured to: