CN113366491A

CN113366491A - Eyeball tracking method, device and storage medium

Info

Publication number: CN113366491A
Application number: CN202180001560.7A
Authority: CN
Inventors: 袁麓; 张国华; 张代齐; 郑爽
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-09-07
Anticipated expiration: 2041-04-26
Also published as: CN113366491B; WO2022226747A1

Abstract

The embodiment of the application provides an eyeball tracking method, a device and a storage medium, comprising the following steps: preprocessing the gray level image and the depth image to obtain a gray level-depth image of a target under a preset coordinate system; performing human head detection on the gray-depth image of the target to obtain a gray-depth image of the head of the target; carrying out face reconstruction processing on the gray level-depth image of the head of the target to obtain face information of the target; and obtaining the pupil position of the target according to the face information. According to the scheme, the point cloud of the target is obtained based on the gray level image and the depth image of the target, the point cloud of the head of the target is obtained by carrying out human head detection, and the pupil position of the target is obtained by carrying out face reconstruction processing according to the point cloud of the head of the target. By adopting the method, the human face of the target is reconstructed based on the information of two dimensions of the gray image and the depth image, and an accurate sight line starting point can be obtained in real time.

Description

Eyeball tracking method, device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an eyeball tracking method, an eyeball tracking device, and a storage medium.

Background

The sight line estimation technology is an important technology for understanding human intention in human-computer interaction, and can be applied to scenes such as game interaction, medical diagnosis (psychological diseases), analysis of driver intention in a cockpit and the like. The sight line starting Point (namely eyeball position) and the sight line direction are two important modules forming sight line estimation, and the two modules can obtain a Point of interest (Point of Regard) of the sight line of the user by combining three-dimensional modeling of a scene environment, so that the intention of the user can be known more accurately, and interaction is completed.

Currently, monocular cameras are used to estimate the position of the gaze starting point in three-dimensional space when determining the eyeball position, and the distance between the human eye and the camera is estimated by using a priori and a camera imaging model. By adopting the technology, under the condition of normal driving distance, the depth error is 2-3 centimeters (cm), and the technology cannot meet the scene with higher precision requirement, such as lighting a central control screen in a vehicle-mounted scene. And the error of the initial point of 2-3cm can cause a large error of the predicted PoR in the corresponding direction, and particularly, the difference between the intersection point of the sight line direction and the object and the true value is larger and larger as the distance of the gazing object from the user is farther, so that the requirement of interaction between the sight line and the object outside the vehicle cannot be met.

Currently, a depth sensor method is adopted to determine the eyeball position, the optimization-based face reconstruction is firstly performed by using depth data in an off-line manner, and when deployment is performed, iterative closest point algorithm processing is performed by using a reconstructed face model and point cloud data acquired in real time to acquire the current pose of 6 degrees of freedom of the face, so that the three-dimensional position of the eyeball is acquired. By adopting the technology, offline registration is needed to obtain the face grid information of the user, and meanwhile, the registration error of the iterative closest point algorithm is larger when the change amplitude of the face expression is larger. Therefore, the prior art cannot cope with an open environment and an actual vehicle-mounted scene.

Disclosure of Invention

The embodiment of the application provides an eyeball tracking method, an eyeball tracking device and a storage medium, so that the eyeball tracking precision is improved.

In a first aspect, an embodiment of the present application provides an eyeball tracking method, including: preprocessing a gray image and a depth image to obtain a gray-depth image of a target under a preset coordinate system, wherein the gray image and the depth image both contain head information of the target; performing human head detection on the gray-depth image of the target to obtain a gray-depth image of the head of the target; carrying out face reconstruction processing on the gray level-depth image of the head of the target to obtain face information of the target; and obtaining the pupil position of the target according to the face information.

According to the embodiment of the application, the gray-depth image of the target is obtained based on the gray image and the depth image of the target, the gray-depth image of the head of the target is obtained by detecting the human head, and the human face reconstruction processing is carried out according to the gray-depth image of the head of the target, so that the pupil position of the target is obtained. By adopting the method, the human face of the target is reconstructed based on the information of two dimensions of the gray image and the depth image, and an accurate sight line starting point can be obtained in real time.

As an optional implementation manner, the performing face reconstruction processing on the gray-level depth image of the head of the target to obtain face information of the target includes: performing feature extraction on the gray level-depth image of the head of the target to obtain a gray level feature and a depth feature of the target; fusing the gray level feature and the depth feature of the target to obtain a human face model parameter of the target; and obtaining the face information of the target according to the face model parameters of the target.

And obtaining the face model parameters of the target by fusing the gray level feature and the depth feature of the target, and further obtaining the face information of the target. The human face model parameters of the target are fused with the gray scale features and the depth features, and compared with the method that only the gray scale features are contained in the prior art, the human face model parameters of the target are more comprehensive, and the eyeball tracking precision can be effectively improved.

As an alternative implementation, the face reconstruction processing on the gray-scale depth image of the head of the target is processed through a face reconstruction network model.

As an optional implementation manner, the face reconstruction network model is obtained by training as follows: respectively extracting the characteristics of a user gray level image sample and a user depth image sample which are input into a face reconstruction network model to obtain the gray level characteristics and the depth characteristics of the user; fusing the gray level features and the depth features of the user to obtain face model parameters of the user, wherein the face model parameters comprise identity parameters, expression parameters, texture parameters, rotation parameters and displacement parameters; obtaining face information according to the face model parameters of the user; and obtaining a loss value according to the face information, if the loss value does not reach a stop condition, adjusting parameters of the face reconstruction network model, and repeatedly executing the steps until the stop condition is reached to obtain the trained face reconstruction network model, wherein the weight of the user eyes in a first loss function corresponding to the loss value is not less than a preset threshold value. The stop condition may be that the loss value is not greater than a preset value.

As another optional implementation, the method further includes: acquiring a first point cloud sample of the user, a point cloud sample of a shelter and a texture sample; overlaying the point cloud sample of the shelter on the first point cloud sample of the user to obtain a second point cloud sample of the user; blanking the second point cloud sample of the user to obtain a third point cloud sample of the user; rendering the third point cloud sample and the texture sample of the shelter to obtain a two-dimensional image sample of the user; and respectively performing enhancement processing of adding noise on the two-dimensional image sample of the user and the third point cloud sample to obtain an enhanced two-dimensional image sample and an enhanced depth image sample of the user, wherein the enhanced two-dimensional image sample and the enhanced depth image sample of the user are respectively a user gray level image sample and a user depth image sample of the input face reconstruction network model.

According to the method and the device, the point cloud sample of the user, the point cloud sample of the shielding object and the texture sample are obtained, and the situation that the shielding object exists is simulated, so that the face reconstruction network model capable of adapting to the shielding object is obtained through training. By adopting the scheme, stronger robustness on the eye shelter can be realized; and the data of the eye region is enhanced, so that the reconstruction precision of the eye region is higher. By adopting the method, the conditions which can occur in various real scenes can be simulated, and the corresponding enhanced two-dimensional image and three-dimensional image are obtained, so that the robustness of the algorithm is improved.

In a second aspect, an embodiment of the present application provides an eyeball tracking device, including: the system comprises a preprocessing module, a data processing module and a data processing module, wherein the preprocessing module is used for preprocessing a gray level image and a depth image to obtain a gray level-depth image of a target under a preset coordinate system, and the gray level image and the depth image both comprise head information of the target; the detection module is used for carrying out human head detection on the gray level-depth image of the target so as to obtain a gray level-depth image of the head of the target; the reconstruction processing module is used for carrying out face reconstruction processing on the gray level-depth image of the head of the target so as to obtain face information of the target; and the acquisition module is used for obtaining the pupil position of the target according to the face information.

As an optional implementation manner, the reconstruction processing module is configured to: performing feature extraction on the gray level-depth image of the head of the target to obtain a gray level feature and a depth feature of the target; fusing the gray level feature and the depth feature of the target to obtain a human face model parameter of the target; and obtaining the face information of the target according to the face model parameters of the target.

As an optional implementation manner, the face reconstruction network model is obtained by training as follows: respectively extracting the characteristics of a user gray level image sample and a user depth image sample which are input into a face reconstruction network model to obtain the gray level characteristics and the depth characteristics of the user; fusing the gray level features and the depth features of the user to obtain face model parameters of the user, wherein the face model parameters comprise identity parameters, expression parameters, texture parameters, rotation parameters and displacement parameters; obtaining face information according to the face model parameters of the user; and obtaining a loss value according to the face information, if the loss value does not reach a stop condition, adjusting parameters of the face reconstruction network model, and repeatedly executing the steps until the stop condition is reached to obtain the trained face reconstruction network model, wherein the weight of the user eyes in a first loss function corresponding to the loss value is not less than a preset threshold value.

As another optional implementation, the apparatus is further configured to: acquiring a first point cloud sample of the user, a point cloud sample of a shelter and a texture sample; overlaying the point cloud sample of the shelter on the first point cloud sample of the user to obtain a second point cloud sample of the user; blanking the second point cloud sample of the user to obtain a third point cloud sample of the user; rendering the third point cloud sample and the texture sample of the shelter to obtain a two-dimensional image sample of the user; and respectively performing enhancement processing of adding noise on the two-dimensional image sample of the user and the third point cloud sample to obtain an enhanced two-dimensional image sample and an enhanced depth image sample of the user, wherein the enhanced two-dimensional image sample and the enhanced depth image sample of the user are respectively a user gray level image sample and a user depth image sample of the input face reconstruction network model.

In a third aspect, the present application provides a computer storage medium comprising computer instructions that, when executed on an electronic device, cause the electronic device to perform the method as provided in any one of the possible embodiments of the first aspect.

In a fourth aspect, the embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to execute the method as provided in any one of the possible embodiments of the first aspect.

In a fifth aspect, embodiments of the present application provide an eye tracking device, including a processor and a memory; wherein the memory is configured to store program code, and the processor is configured to call the program code to perform the method as provided in any one of the possible embodiments of the first aspect.

In a sixth aspect, an embodiment of the present application provides a server, which includes a processor, a memory, and a bus, where: the processor and the memory are connected through the bus; the memory is used for storing a computer program; the processor is configured to control the memory and execute the program stored in the memory to implement the method according to any one of the possible embodiments of the first aspect.

It is to be understood that the apparatus of the second aspect, the computer storage medium of the third aspect, the computer program product of the fourth aspect, the apparatus of the fifth aspect, and the server of the sixth aspect, which are provided above, are all configured to perform the method provided in any one of the first aspect. Therefore, the beneficial effects achieved by the method can refer to the beneficial effects in the corresponding method, and are not described herein again.

Drawings

Fig. 1 is a schematic flowchart of an eyeball tracking method according to an embodiment of the present application;

fig. 2 is a schematic diagram of an image preprocessing method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a face model reconstruction method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a training method for reconstructing a face model according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another eye tracking method according to an embodiment of the present disclosure;

fig. 6a is a schematic diagram of an image before being processed according to an embodiment of the present application;

fig. 6b is a schematic diagram of an image after being processed according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an eyeball tracking device provided by the embodiment of the application;

fig. 8 is a schematic structural diagram of another eye tracking device according to an embodiment of the present disclosure.

Detailed Description

The embodiments of the present application are applicable to technologies or scenes such as gaze estimation and gaze tracking in vehicle-mounted scenes, game interaction, and the like.

Fig. 1 is a schematic flow chart of an eyeball tracking method according to an embodiment of the present disclosure. The eyeball tracking method provided in the embodiment of the application can be executed by a vehicle-mounted device (such as a car machine), and can also be executed by terminal equipment such as a mobile phone, a computer and the like. The present solution is not particularly limited to this. As shown in fig. 1, the method may include steps 101-104, which are as follows:

101. preprocessing a gray image and a depth image to obtain a gray-depth image of a target under a preset coordinate system, wherein the gray image and the depth image both contain head information of the target;

the target may be a user, a robot, or the like, and this is not particularly limited in the embodiment of the present application.

As an optional implementation manner, as shown in fig. 2, the grayscale image and the depth image are preprocessed, and the grayscale image with high resolution of the target is acquired by an infrared Sensor (IR), and the depth image with low resolution of the target is acquired by a depth camera; and then aligning, interpolating, fusing and the like the low-resolution depth image and the high-resolution gray image to obtain a high-resolution point cloud under the coordinates of the infrared sensor.

Specifically, the infrared sensor and the Depth sensor are calibrated to obtain a coordinate system conversion relationship, then the Depth of the Depth sensor is converted into the infrared sensor coordinate system, and finally aligned infrared-Depth IR-Depth data, namely a gray-Depth image of the target, is output.

102. Performing human head detection on the gray-depth image of the target to obtain a gray-depth image of the head of the target;

as an alternative implementation, the gray-level depth image of the target is subjected to human head detection by using a detection algorithm, which may be, for example, a common human head detection algorithm based on deep learning.

103. Carrying out face reconstruction processing on the gray level-depth image of the head of the target to obtain face information of the target;

as an alternative implementation manner, as shown in fig. 3, a schematic diagram of a face model reconstruction method provided in the embodiment of the present application is shown. Performing feature extraction on the gray level-depth image of the head of the target to obtain a gray level feature and a depth feature of the target; and carrying out fusion processing on the gray level feature and the depth feature of the target to obtain the human face model parameter of the target.

Optionally, the face model parameters include an identity parameter, an expression parameter, a texture parameter, a rotation parameter, a displacement parameter, and a spherical harmonic parameter. Wherein, the identity parameter refers to the identity information of the user; the expression parameters refer to the expression information of the user; the texture parameter is an albedo principal component coefficient indicating the user; the rotation parameter refers to a rotation vector of the head of the user converted from a world coordinate system to a camera coordinate system; the displacement parameter refers to a translation vector of the head of the user converted from a world coordinate system to a camera coordinate system; the spherical harmonic parameters refer to parameters of the illumination model and are used for modeling the illumination model.

And further obtaining the face information of the target based on the face model parameters of the target.

As another optional implementation manner, the gray-depth image of the head of the target is input to a face reconstruction network model for processing, so as to obtain the face information of the target. The human face reconstruction network model obtains the gray characteristic and the depth characteristic of the target by extracting the characteristic of the gray-depth image of the head of the target; performing fusion processing on the gray level feature and the depth feature of the target to obtain a human face model parameter of the target; and further obtaining the face information of the target according to the face model parameters of the target. That is to say, the face model parameters are regressed through the face reconstruction network model, and face mesh information, namely face information, under a preset coordinate system is further acquired.

Specifically, the gray-depth image of the head of the target is input into a first feature extraction layer of a face reconstruction network model for gray feature extraction, the gray-depth image of the head of the target is input into a second feature extraction layer for depth feature extraction, then the features extracted by the first feature extraction layer and the second feature extraction layer are input into a feature fusion layer for fusion processing, and finally, face model parameters obtained by face reconstruction network model regression are output.

The face reconstruction network model can be obtained by adopting convolutional neural network training. Specifically, as shown in fig. 4, feature extraction is performed on a gray level image sample of a user and a depth image sample of the user, which are input into a face reconstruction network model, respectively, so as to obtain a gray level feature and a depth feature of the user; then, carrying out fusion processing on the gray level feature and the depth feature of the user to obtain the face model parameters of the user, wherein the face model parameters comprise identity parameters, expression parameters, texture parameters, rotation parameters, displacement parameters and spherical harmonic parameters; obtaining face information according to the face model parameters of the user; obtaining a loss value according to the face information, the user gray level image sample and the user depth image sample, if the loss value does not reach a stop condition, adjusting parameters of the face reconstruction network model, and repeatedly executing the steps until the stop condition is reached to obtain the trained face reconstruction network model, wherein the weight of the user eyes in a first loss function corresponding to the loss value is not less than a preset threshold value. The first loss function may be a geometric loss function.

As an alternative implementation, the convolutional neural network is trained in an auto-supervision manner. It includes the following three loss functions:

1) geometric loss E_gro(X) for calculating an error between the face point cloud and the depth image point cloud:

E_gro(X)＝w_ppE_pp(X)+w_psE_ps(X)；

wherein E is_pp(X) is a point-to-point loss; e_ps(X) is the loss of points to the surface of the face model; w is a_ppIs a point-to-point weight; w is a_psPoint-to-face weights.

2) Face key point loss E_lan(X) calculating a three-dimensional key point projection error of the human face model;

wherein L is a visible face key point; LP is a visible eye key point; q. q.s_iThe ith key point of the face is taken as the face; p is a radical of_iIs the ith three-dimensional (3D) key point on the face model; r is a rotation matrix; t is a displacement vector;

||(q_i-q_j)-(∏(Rp_i+t)-∏(Rp_j+t))||₂pair of expression (q)_i-q_j)-(∏(Rp_i+t)-∏(Rp_j+ t)) square and reopen; sigma_i∈L‖q_i-∏(Rp_i+t)‖²Represents a pair | q_i-∏(Rp_i+t)‖²A summation wherein | q_i-Rpi + t2 represents the absolute value of qi-Rpi + t and then the sum of squares; i. j is a positive integer.

3) Pixel loss E_col(X) calculating a gray difference between a rendering gray of the face model and an IR gray image;

wherein, F is a pixel point visible for the human face model; i is_synPixel values rendered for composition; i is_realAre pixel values in the actual image.

The convolutional neural network adopts the following face model regular loss E_reg(X) face constraint:

wherein alpha is_idThe face identity coefficient; alpha is alpha_albIs the face albedo coefficient; alpha is alpha_expIs a facial expression coefficient; sigma_idIs an identity coefficient weight; sigma_albIs the albedo coefficient; sigma_expIs the expression coefficient weight.

Because human eyes are the key positions in the eyeball tracking process, the scheme can properly increase the geometric loss E of the human eyes_gro(X) weights in (X) to calculate the error between the face point cloud and the depth image point cloud:

E_gro(X)＝w₁E_eve(X)+w₂E_nose(X)+w₃E_mouth(X)+w₄E_other(X)；

wherein E is_eve(X) loss of the vertex of the eye region in the face model; e_nose(X) is the loss of the vertex of the nose region in the face model; e_mouth(X) loss of the vertex of the mouth region in the face model; e_other(X) the vertex loss of other areas in the face model; w is a₁Coefficients for the eye region in the face model; w is a₂The coefficients of the nose region in the face model; w is a₃Coefficients of a mouth region in the face model; w is a₄Coefficients for other regions in the face model.

Wherein the coefficient w of the eye region in the face model₁Satisfies a condition not less than a preset threshold value. The preset threshold valueAnd may be any value. For example, w₁Satisfies the following conditions: w is a₁Not less than w₂、w₁Not less than w₃And w₁Not less than w₄。

The embodiment aims at the loss weight enhancement of the eye region, so that the reconstruction precision of the eye region is higher.

And calculating to obtain a geometric loss value, a face key point loss value and a pixel loss value based on the three loss functions. And if the geometric loss value is not greater than a preset geometric loss threshold value, the face key point loss value is not greater than a preset key point loss threshold value, and the pixel loss value is not greater than a preset pixel loss threshold value, stopping training to obtain a trained face reconstruction network model. If the loss values do not meet the condition, adjusting network parameters, and repeatedly executing the training process until the stop condition is reached.

The stop condition in the above embodiment is explained by taking an example in which the loss value is not greater than the preset loss threshold value. The stopping condition may also be that the number of iterations reaches a preset number, and the like, and this is not specifically limited by the present scheme.

The above description is given by taking three loss functions as examples. Other loss functions may also be used, and this is not specifically limited in this embodiment.

104. And obtaining the pupil position of the target according to the face information.

As an optional implementation manner, the coordinates of the pupils of the eyes can be further obtained according to the eye region key points of the three-dimensional face. Specifically, the pupil position of the target is obtained by solving according to position information of preset key points such as eyelids and canthus on the human face. The pupil position is the starting point of the line of sight.

The embodiments of the present application are described only by taking eye tracking as an example. By adopting the above method, the position of the mouth, the position of the nose, the position of the ear, and the like of the target can be obtained, and the scheme is not particularly limited.

The focus of the sight line starting point is on the accuracy of the eye region, and the result of eyeball tracking is affected when the eyes of the target are shielded by hands, glasses, a hat and the like, or image change caused by light change, depth error of a depth image and the like. In order to simulate the situations which can occur in various real scenes and enable the face reconstruction network model to cope with various complex scenes, the scheme also provides an eyeball tracking method, and eyeball tracking is carried out based on the obtained enhanced two-dimensional image and three-dimensional point cloud image of the key area corresponding to the target, so that the robustness of the algorithm is improved.

Fig. 5 is a schematic flowchart of another eyeball tracking method according to an embodiment of the present disclosure. The eyeball tracking method provided in the embodiment of the application can be executed by a vehicle-mounted device (such as a car machine), and can also be executed by terminal equipment such as a mobile phone, a computer and the like. The present solution is not particularly limited to this. As shown in fig. 5, the method may include

steps

501 and 504, which are as follows:

501. preprocessing a gray image and a depth image to obtain a gray-depth image of a target under a preset coordinate system, wherein the gray image and the depth image both contain head information of the target;

Specifically, the infrared sensor and the Depth sensor are calibrated to obtain a coordinate system conversion relationship, then the Depth of the Depth sensor is converted into the infrared sensor coordinate system, and finally aligned IR-Depth data, namely a gray-Depth image of the target, is output.

502. Performing human head detection on the gray-depth image of the target to obtain a gray-depth image of the head of the target;

503. Carrying out face reconstruction processing on the gray level-depth image of the head of the target to obtain face information of the target;

the face reconstruction network model can be obtained by training based on

steps

5031 and 5039, and the details are as follows:

5031. acquiring a first point cloud sample of a user, a point cloud sample of a shelter and a texture sample;

the first point cloud sample may be an original point cloud sample of the user, i.e., the point cloud sample of the user without the obstruction.

The shelter is a shelter for the eyes, such as hands, glasses, a hat and the like, or other influences of light change and the like.

5032. Overlaying the point cloud sample of the shelter on the first point cloud sample of the user to obtain a second point cloud sample of the user;

and superposing the point cloud sample of the shielding object in front of the visual angle of the first point cloud sample camera of the user (namely on a camera coordinate system) to obtain a second point cloud sample of the user.

5033. Blanking the second point cloud sample of the user to obtain a third point cloud sample of the user;

in the process of drawing the realistic graphics, the depth information is lost due to projection transformation, which often results in the ambiguity of the graphics. To remove such ambiguities, it is necessary to remove the hidden invisible lines or surfaces during rendering, which is conventionally referred to as removing hidden lines and hidden surfaces, or simply blanking.

And (4) carrying out blanking processing on invisible points behind the shielding object, such as removing the point cloud after the shielding object by adopting a blanking algorithm (for example, a Z-buffer algorithm), so as to obtain a third point cloud sample of the blanked user.

5034. Rendering a third point cloud sample of the user and a texture sample of the shelter to obtain a two-dimensional image sample of the user;

the texture sample of the shielding object positioned in front of the user is rendered to cover the texture of the user behind the shielding object, so that the two-dimensional image sample of the user can be obtained.

5035. Respectively performing enhancement processing of adding noise on the two-dimensional image sample of the user and the third point cloud sample to obtain an enhanced two-dimensional image sample and an enhanced depth image sample of the user, wherein the enhanced two-dimensional image sample and the enhanced depth image sample of the user are respectively a user gray level image sample and a user depth image sample of the input face reconstruction network model;

two-dimensional images and three-dimensional point clouds are obtained after shielding enhancement is carried out, and blocks in various shapes can be superposed to serve as noise. The pixel values or point cloud coordinate values within such a block may conform to a predetermined distribution (e.g., the pixel value distribution satisfies a gaussian distribution with a mean value of 10 and a standard deviation of 0.1, and the point cloud coordinate is assigned a value of zero). To be more realistic, illumination noise, Time of flight (TOF) sensor noise data may also be simulated. For example, blocks of 25 × 25 pixel size, 50 × 50 pixel size, and 100 × 100 pixel size are randomly generated on an IR image and a TOF point cloud, where the gray values of the gray blocks in the two-dimensional image satisfy a gaussian distribution, the mean value of the distribution is the pixel mean value of the corresponding block in the original image, and the standard deviation is 0.01. The block in the point cloud picture can simulate noise such as holes, and the setting depth is zero at the moment. The effect is shown in fig. 6b, where fig. 6a is an effect diagram without superimposed noise.

As an alternative implementation, an original two-dimensional image and three-dimensional point cloud of the user in the cabin are acquired. And acquiring three-dimensional scanning point cloud and texture information of the shielding object by using a scanner. And overlapping the point cloud information of the shielding object on the three-dimensional point cloud information of the user, and removing the point cloud after the shielding object through a z-buffer algorithm to obtain the processed point cloud of the user. And rendering the processed point cloud of the user by scanning the texture of the shielding object to generate a two-dimensional image of the processed user.

Taking the hand occlusion as an example, in order to obtain data of the hand occlusion at various different positions, a scanner may be used to scan the hand first, and three-dimensional point cloud and texture information of the hand are obtained. In the original image, the position of a face key point in a two-dimensional image is obtained by using a face key point algorithm, and the position of the key point in a camera coordinate system can be found in a depth image or a three-dimensional point cloud image according to the position in the image. Then, the three-dimensional model of the hand obtained by scanning before can be put at the corresponding position through the coordinate information of the key point on the face. The occlusion is now in front, and from the sensor perspective, some face regions that were not occluded are now occluded by the hand, and the cloud of face points behind the hand can be eliminated using a blanking algorithm (e.g., z-buffer algorithm). Thus, a complete composite point cloud data can be obtained.

After the point cloud data is acquired, texture information can be acquired according to the point cloud data, and a two-dimensional image under the camera view angle can be rendered, so that an enhanced two-dimensional image and a three-dimensional depth image are acquired.

The above description is only given by way of example, and reflective glasses, opaque sunglasses, and other accessory data that may cause occlusion may also be synthesized. The reconstruction data of the 3d object is obtained through the scanner, the rotation matrix R and the displacement vector T of the human eyes relative to the camera are roughly estimated through an algorithm, the 3d object is moved to the corresponding position by utilizing R, T, the 3d object is superposed on the TOF point cloud data of the flight time by utilizing a blanking algorithm, the grid gray information is superposed on the IR image through perspective projection, and then data synthesis is completed.

5036. Inputting the user gray level image sample and the user depth image sample into a face reconstruction network model to obtain the gray level feature and the depth feature of the user;

the user gray level image sample here is the enhanced two-dimensional image sample of the user, and the user depth image sample here is the enhanced depth image sample.

5037. Fusing the gray level features and the depth features of the user to obtain face model parameters of the user;

5038. obtaining face information according to the face model parameters of the user;

5039. obtaining a loss value according to the face information, a first gray image sample and a first depth image sample of the user, if the loss value does not reach a stop condition, adjusting parameters of the face reconstruction network model, and repeatedly executing the steps until the stop condition is reached to obtain the trained face reconstruction network model, wherein the weight of the user eyes in a first loss function corresponding to the loss value is not less than a preset threshold value;

the first grayscale image sample of the user is an original grayscale image sample of the user, that is, the grayscale image sample of the user when there is no obstruction. The first depth image sample of the user is an original depth image sample of the user, that is, a depth image sample of the user without an obstruction.

For the related description of the step 5036 and the step 5039, reference may be made to the foregoing embodiments, which are not described herein again.

504. And obtaining the pupil position of the target according to the face information.

According to the method and the device, the point cloud sample of the user, the point cloud sample of the shielding object and the texture sample are obtained, and the situation that the shielding object exists is simulated, so that the face reconstruction network model capable of adapting to the shielding object is obtained through training. By adopting the scheme, the data of the eye region is enhanced, so that the reconstruction precision of the eye region is higher; and conditions which can occur in various real scenes can be simulated, and the corresponding enhanced two-dimensional image and three-dimensional image are obtained, so that the robustness of the algorithm is improved.

It should be noted that the eyeball tracking method provided by the application can be executed locally, and can also be executed by uploading the grayscale image and the depth image of the target to the cloud. The cloud end can be realized by a server, the server can be a virtual server, an entity server and the like, and can also be other devices, and the scheme is not particularly limited to this.

Referring to fig. 7, an eyeball tracking apparatus is provided for an embodiment of the present application, where the apparatus may be a vehicle-mounted apparatus (e.g., a car machine), and may also be a terminal device such as a mobile phone and a computer. The device comprises a preprocessing module 701, a detection module 702, a reconstruction processing module 703 and an acquisition module 704, and the details are as follows:

the preprocessing module 701 is configured to preprocess the grayscale image and the depth image to obtain a grayscale-depth image of the target in a preset coordinate system, where the grayscale image and the depth image both include header information of the target;

a detection module 702, configured to perform human head detection on the gray-level depth image of the target to obtain a gray-level depth image of the head of the target;

a reconstruction processing module 703, configured to perform face reconstruction processing on the gray-level depth image of the head of the target to obtain face information of the target;

an obtaining module 704, configured to obtain a pupil position of the target according to the face information.

As an optional implementation manner, the reconstruction processing module 703 is configured to:

performing feature extraction on the gray level-depth image of the head of the target to obtain a gray level feature and a depth feature of the target;

fusing the gray level feature and the depth feature of the target to obtain a human face model parameter of the target;

and obtaining the face information of the target according to the face model parameters of the target.

As an optional implementation manner, the face reconstruction network model is obtained by training as follows:

respectively extracting the characteristics of a user gray level image sample and a user depth image sample which are input into a face reconstruction network model to obtain the gray level characteristics and the depth characteristics of the user;

fusing the gray level features and the depth features of the user to obtain face model parameters of the user, wherein the face model parameters comprise identity parameters, expression parameters, texture parameters, rotation parameters and displacement parameters;

obtaining face information according to the face model parameters of the user;

and obtaining a loss value according to the face information, if the loss value does not reach a stop condition, adjusting parameters of the face reconstruction network model, and repeatedly executing the steps until the stop condition is reached to obtain the trained face reconstruction network model, wherein the weight of the user eyes in a first loss function corresponding to the loss value is not less than a preset threshold value.

It should be noted that the preprocessing module 701, the detecting module 702, the reconstructing processing module 703 and the obtaining module 704 are configured to execute relevant steps of the foregoing method. For example, the preprocessing module 701 is configured to execute the relevant content of step 101 and/or step 501, the detection module 702 is configured to execute the relevant content of step 102 and/or step 502, the reconstruction processing module 703 is configured to execute the relevant content of step 103 and/or step 503, and the acquisition module 704 is configured to execute the relevant content of step 104 and/or step 504.

According to the method and the device, the point cloud sample of the user, the point cloud sample of the shielding object and the texture sample are obtained, and the situation that the shielding object exists is simulated, so that the face reconstruction network model capable of adapting to the shielding object is obtained through training. By adopting the scheme, the data of the eye region is enhanced, so that the reconstruction precision of the eye region is higher; and conditions which can occur in various real scenes can be simulated, and the corresponding enhanced two-dimensional image and the corresponding enhanced three-dimensional point cloud image are obtained, so that the robustness of the algorithm is improved.

In this embodiment, the eyeball tracking device is represented in a module form. A "module" herein may refer to an application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that may provide the described functionality. Further, the above preprocessing module 701, the detection module 702, the reconstruction processing module 703 and the acquisition module 704 may be implemented by the processor 801 of the eye tracking apparatus shown in fig. 8.

Fig. 8 is a schematic structural diagram of another eyeball tracking device according to an embodiment of the present application. As shown in fig. 8, the eye tracking apparatus 800 comprises at least one processor 801, at least one memory 802 and at least one communication interface 803. The processor 801, the memory 802 and the communication interface 803 are connected through the communication bus and perform communication with each other.

The processor 801 may be a general purpose Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs according to the above schemes.

Communication interface 803 is used for communicating with other devices or communication Networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Networks (WLAN), etc.

The Memory 802 may be a Read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.

The memory 802 is used for storing application program codes for executing the above schemes, and is controlled by the processor 801 to execute. The processor 801 is used to execute application program code stored in the memory 802.

The memory 802 stores code that may perform one of the eye tracking methods provided above.

It should be noted that although the eye tracking apparatus 800 shown in fig. 8 only shows a memory, a processor and a communication interface, in the specific implementation process, those skilled in the art will understand that the eye tracking apparatus 800 also includes other devices necessary for normal operation. Also, as may be appreciated by those skilled in the art, the eye tracking apparatus 800 may also include hardware components for performing other additional functions, according to particular needs. Furthermore, those skilled in the art will appreciate that the eye tracking apparatus 800 may also include only those components necessary to implement the embodiments of the present application, and need not include all of the components shown in FIG. 8.

The embodiment of the application also provides a chip system, and the chip system is applied to the electronic equipment; the chip system comprises one or more interface circuits, and one or more processors; the interface circuit and the processor are interconnected through a line; the interface circuit is to receive a signal from a memory of the electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory; the electronic device performs the method when the processor executes the computer instructions.

Embodiments of the present application also provide a computer-readable storage medium having stored therein instructions, which when executed on a computer or processor, cause the computer or processor to perform one or more steps of any one of the methods described above.

The embodiment of the application also provides a computer program product containing instructions. The computer program product, when run on a computer or processor, causes the computer or processor to perform one or more steps of any of the methods described above.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

It should be understood that in the description of the present application, unless otherwise indicated, "/" indicates a relationship where the objects associated before and after are an "or", e.g., a/B may indicate a or B; wherein A and B can be singular or plural. Also, in the description of the present application, "a plurality" means two or more than two unless otherwise specified. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance. Also, in the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as examples, illustrations or illustrations. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion for ease of understanding.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the division of the unit is only one logical function division, and other division may be implemented in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. The shown or discussed mutual coupling, direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a read-only memory (ROM), or a Random Access Memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).

The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any changes or substitutions within the technical scope disclosed in the embodiments of the present application should be covered by the scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims

1. An eye tracking method, comprising:

preprocessing a gray image and a depth image to obtain a gray-depth image of a target under a preset coordinate system, wherein the gray image and the depth image both contain head information of the target;

performing human head detection on the gray-depth image of the target to obtain a gray-depth image of the head of the target;

carrying out face reconstruction processing on the gray level-depth image of the head of the target to obtain face information of the target;

and obtaining the pupil position of the target according to the face information.

2. The method according to claim 1, wherein the performing a face reconstruction process on the gray-depth image of the head of the target to obtain the face information of the target comprises:

3. The method of claim 2, wherein the face reconstruction processing of the gray-depth image of the head of the object is processed through a face reconstruction network model.

4. The method of claim 3, wherein the face reconstruction network model is trained by:

obtaining face information according to the face model parameters of the user;

5. The method of claim 4, further comprising:

acquiring a first point cloud sample of the user, a point cloud sample of a shelter and a texture sample;

overlaying the point cloud sample of the shelter on the first point cloud sample of the user to obtain a second point cloud sample of the user;

blanking the second point cloud sample of the user to obtain a third point cloud sample of the user;

rendering the third point cloud sample and the texture sample of the shelter to obtain a two-dimensional image sample of the user;

and respectively performing enhancement processing of adding noise on the two-dimensional image sample of the user and the third point cloud sample to obtain an enhanced two-dimensional image sample and an enhanced depth image sample of the user, wherein the enhanced two-dimensional image sample and the enhanced depth image sample of the user are respectively a user gray level image sample and a user depth image sample of the input face reconstruction network model.

6. An eye tracking device, comprising:

the system comprises a preprocessing module, a data processing module and a data processing module, wherein the preprocessing module is used for preprocessing a gray level image and a depth image to obtain a gray level-depth image of a target under a preset coordinate system, and the gray level image and the depth image both comprise head information of the target;

the detection module is used for carrying out human head detection on the gray level-depth image of the target so as to obtain a gray level-depth image of the head of the target;

the reconstruction processing module is used for carrying out face reconstruction processing on the gray level-depth image of the head of the target so as to obtain face information of the target;

and the acquisition module is used for obtaining the pupil position of the target according to the face information.

7. The apparatus of claim 6, wherein the reconstruction processing module is configured to:

8. The apparatus of claim 7, wherein the face reconstruction process on the gray-depth image of the head of the object is processed by a face reconstruction network model.

9. The apparatus of claim 8, wherein the face reconstruction network model is trained by:

obtaining face information according to the face model parameters of the user;

10. The apparatus of claim 9, wherein the apparatus is further configured to:

11. An eye tracking device comprising a processor and a memory; wherein the memory is configured to store program code and the processor is configured to invoke the program code to perform the method of any of claims 1 to 5.

12. A computer-readable storage medium, characterized in that it stores a computer program which is executed by a processor to implement the method of any one of claims 1 to 5.

13. A computer program product, characterized in that, when the computer program product is run on a computer, it causes the computer to perform the method according to any of claims 1 to 5.

14. A server, comprising a processor, a memory, and a bus, wherein:

the processor and the memory are connected through the bus;

the memory is used for storing a computer program;

the processor is configured to control the memory and execute the program stored in the memory to implement the method of any one of claims 1 to 5.