CN114494347A

CN114494347A - Single-camera multi-mode sight tracking method and device and electronic equipment

Info

Publication number: CN114494347A
Application number: CN202210073470.5A
Authority: CN
Inventors: 姚超; 班晓娟; 杨金汩; 王笑琨; 孙金胜
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2022-05-13

Abstract

The invention discloses a single-camera multi-mode sight tracking method and device and electronic equipment, belonging to the technical field of man-machine interaction, wherein the method comprises the following steps: aligning the face information acquired by the camera by an attention-enhanced self-adaptive three-dimensional face alignment method; calibrating the human face reference characteristics, and solving a human face reference vector in the sight tracking process; calculating normal vectors of triangular surfaces of a plurality of key points on the face mesh by adopting a three-dimensional face alignment result; representing a reference projection vector by taking the center of the three-dimensional face as a starting point, and fitting a sight line transfer matrix to obtain a dynamic reference vector of the flexible head in a three-dimensional space; establishing specific information of a user to be tracked; in the continuous sight line interaction process, acquiring iris data of human eyes in a de-noised image with a prominent edge by using Hough transform; according to the human eye iris data and the human face base vector, the eye movement reference vector is determined, the sight tracking is completed, the timeliness of the sight tracking can be improved, and the dependence on hardware equipment is weak.

Description

Single-camera multi-mode sight tracking method and device and electronic equipment

Technical Field

The invention relates to the technical field of man-machine interaction, in particular to a single-camera multi-mode sight tracking method and device and electronic equipment.

Background

Line-of-sight tracking has always been a focus of attention in the field of interaction, and algorithm research and interactive application aiming at line-of-sight tracking are quite mature. The human face alignment characteristic of the three-dimensional space visually represents sight intention, and is a basic information source of most sight tracking methods, and the research of the three-dimensional human face alignment algorithm has great influence on the sight tracking technology. The sight tracking method can be divided into two major categories, namely a sight tracking method based on hardware equipment and a sight tracking method based on computer vision, and the sight tracking method based on computer vision can be divided into an end-to-end sight tracking algorithm based on specific data and a sight tracking algorithm based on a fixed point calibration scheme. With the continuous development of sight tracking, a sight tracking scheme based on hardware approaches to maturity, but because the cost of hardware equipment and the environmental requirements are difficult to popularize comprehensively, most of the hardware equipment is tested on the ground in an industrial scene, and the scheme based on computer vision obtains more and more attention. More end-to-end algorithms construct a data set and sight tracking method of a proprietary environment, and a fixed-point calibration scheme increasingly positions three-dimensional space by means of multiple cameras, so that hardware dependence is increased.

Aiming at the related problems of face alignment, a large number of researchers design a plurality of face alignment methods which accord with the face structure characteristics through the curvature characteristics and the standard face structure characteristics of the three-dimensional face. Some researchers segment a convex region within the Image based on features of mean and Gaussian curvature and create an Extended Gaussian Image (EGI) for each convex region. The EGI describes the shape of an object by a surface normal distribution on the object surface, and by correlating EGIs, a region of the target image is matched with a region of the gallery image. However, EGI is insensitive to changes in object size, so that two faces of similar shape but different size cannot be distinguished in such a representation. Aiming at the related problems of sight tracking, some researchers provide a method based on a generalized regression neural network, screen coordinates are mapped by using pupil parameters, pupil flicker displacement, the directions and proportions of the long axis and the short axis of a pupil ellipse and flicker coordinates, and the method does not need calibration after initial training but only moderately improves head movement.

There is a strong need for those skilled in the art to provide a line-of-sight tracking method with good real-time performance and small dependence on equipment.

Disclosure of Invention

The embodiment of the invention aims to provide a single-camera multi-mode sight tracking method and device and electronic equipment, which can improve the real-time performance and have small dependence on equipment when a user is subjected to sight tracking.

In order to solve the technical problems, the invention provides the following technical scheme:

a single-camera multi-mode gaze tracking method, wherein the method comprises:

aligning face information acquired by a camera by using an attention-enhanced self-adaptive three-dimensional face alignment method to obtain a three-dimensional face alignment result;

calibrating the human face reference characteristics, and solving a human face reference vector in the sight tracking process;

calculating normal vectors of triangular surfaces of a plurality of key points on the face mesh by adopting the three-dimensional face alignment result;

representing a reference projection vector by taking the center of the three-dimensional face as a starting point, and fitting a sight line transfer matrix to obtain a dynamic reference vector of the flexible head in a three-dimensional space;

calibrating the characteristics of the user according to the interaction habits of different users, and establishing the specific information of the user to be tracked;

in the continuous sight line interaction process, acquiring iris data of human eyes in a de-noised image with a prominent edge by utilizing Hough transform; and determining an eye movement reference vector according to the human eye iris data and the human face base vector to finish sight tracking.

The method comprises the following steps of aligning face information collected by a camera through an attention-enhanced self-adaptive three-dimensional face alignment method to obtain a three-dimensional face alignment result, wherein the steps comprise:

determining a self-adaptive face detection method, and collecting face information by adopting the determined self-adaptive face detection method;

aiming at face information collected by a camera, extracting a multi-scale image receptive field through a multi-scale convolution kernel, and extracting face features in the face information by utilizing a deformation convolution to determine the deformed face features;

and introducing an attention mechanism in the process of identifying the three-dimensional face key points of the face information acquired by the camera to obtain a three-dimensional face alignment result.

The step of determining the adaptive face detection method comprises the following steps:

removing irrelevant feature points in the face information acquired by the camera by adopting twice pooling and cascade correction linear units;

acquiring feature information of different structures by using the Incepotion feature module as a backbone network;

and (4) carrying out shape self-adaptation, and extracting features of different scales through a plurality of different convolution kernels after the backbone network by adopting a single-mapping multi-frame prediction method.

Wherein, the process of identifying the three-dimensional face key points of the face information collected by the camera is performed by introducing an attention mechanism, and the obtaining of the three-dimensional face alignment result comprises the following steps:

extracting three-dimensional face features based on a heat map, extracting feature maps of two-dimensional key points by adopting a DenseNet front two-layer structure, and integrating the features into 69 key point heat maps, wherein 68 key points are used, and 1 key point is used as a background;

and performing t-feature decoding by adopting a stackable residual error structure based on an attention mechanism.

In the continuous sight line interaction process, human eye iris data in a de-noised image with a prominent edge are obtained by utilizing Hough transform; according to the human eye iris data and the human face base vector, determining an eye movement reference vector to complete sight tracking, wherein the method comprises the following steps:

in the continuous sight line interaction process, based on the face reference vector, an iris identification method based on scale-fixed Hough features is adopted to position the dynamic iris center of the eye movement vector;

selecting key point coordinates with stable characteristics in the key points of the human face as reference coordinates, and calculating the central coordinates of the iris of the eyes and the key characteristics of the three-dimensional space information of the reference vector of the human face;

and according to the key characteristics of the three-dimensional space information of the face reference vector, completing eye movement vector sight tracking by adopting a sight tracking regression function.

A single-camera, multi-mode gaze tracking device, wherein the device comprises:

the alignment module is used for aligning the face information acquired by the camera through an attention-enhanced self-adaptive three-dimensional face alignment method to obtain a three-dimensional face alignment result;

the calibration module is used for calibrating the human face reference characteristics and solving the human face reference vector in the sight tracking process;

the computing module is used for computing normal vectors of triangular surfaces of a plurality of key points on the face mesh by adopting the three-dimensional face alignment result;

the fitting module is used for representing a reference projection vector by taking the three-dimensional face center as a starting point, fitting a sight line transfer matrix and obtaining a dynamic reference vector of the flexible head in a three-dimensional space;

the establishing module is used for calibrating the characteristics of the user according to the interaction habits of different users and establishing the specific information of the user to be tracked;

the tracking module is used for acquiring the iris data of the human eye in the de-noised image with the prominent edge by utilizing Hough transform in the continuous sight line interaction process; and determining an eye movement reference vector according to the human eye iris data and the human face base vector to finish sight tracking.

Wherein the alignment module comprises:

the first sub-module is used for determining a self-adaptive face detection method and collecting face information by adopting the determined self-adaptive face detection method;

the second sub-module is used for extracting a multi-scale image receptive field through a multi-scale convolution kernel aiming at the face information collected by the camera, extracting the face features in the face information by utilizing a deformation convolution and determining the deformed face features;

and the third sub-module is used for introducing an attention mechanism in the process of carrying out three-dimensional face key point identification on the face information acquired by the camera so as to obtain a three-dimensional face alignment result.

When the first sub-module determines the adaptive face detection method, the first sub-module is specifically configured to:

Wherein the third sub-module is specifically configured to:

Wherein the tracking module comprises:

the fourth sub-module is used for positioning the dynamic iris center of the eye movement vector by adopting an iris identification method based on the scale fixation Hough feature based on the face reference vector in the continuous sight line interaction process;

the fifth submodule is used for selecting the key point coordinates with stable characteristics in the key points of the human face as reference coordinates, and calculating the central coordinates of the iris of the two eyes and the key characteristics of the three-dimensional space information of the reference vector of the human face;

and the sixth submodule is used for finishing eye movement vector sight tracking by adopting a sight tracking regression function according to the key characteristics of the three-dimensional space information of the face reference vector.

The embodiment of the invention provides electronic equipment, which comprises a processor, a memory and a program or an instruction which is stored on the memory and can run on the processor, wherein when the program or the instruction is executed by the processor, the steps of any one of the single-camera multi-mode sight line tracking methods are realized.

The embodiment of the invention provides a readable storage medium, wherein a program or an instruction is stored on the readable storage medium, and the program or the instruction realizes the steps of any one of the above single-camera multi-mode sight tracking methods when being executed by a processor.

According to the single-camera projection multi-mode sight tracking method provided by the embodiment of the invention, the face information acquired by the camera is aligned by an attention-enhanced self-adaptive three-dimensional face alignment method to obtain a three-dimensional face alignment result; calibrating the human face reference characteristics, and solving a human face reference vector in the sight tracking process; calculating normal vectors of triangular surfaces of a plurality of key points on the face mesh by adopting a three-dimensional face alignment result; representing a reference projection vector by taking the center of the three-dimensional face as a starting point, and fitting a sight line transfer matrix to obtain a dynamic reference vector of the flexible head in a three-dimensional space; calibrating the characteristics of the user according to the interaction habits of different users, and establishing the specific information of the user to be tracked; in the continuous sight line interaction process, acquiring iris data of human eyes in a de-noised image with a prominent edge by utilizing Hough transform; and determining an eye movement reference vector according to the iris data of the human eyes and the human face base vector to finish the sight tracking. According to the single-camera multi-mode sight tracking method provided by the embodiment of the invention, on one hand, the problems of equipment dependence and environment dependence are solved through a face detection method with a self-adaptive scale and a sight tracking method based on single-camera multi-mode, the accuracy and the real-time performance of sight tracking are improved, and the cost of a sight tracking interaction mode is reduced; the second aspect has wide application range, can be applied to simpler environment and environment with higher complexity, and is applicable to the problems of partial shielding, distortion, deformation and the like of the human face.

Drawings

FIG. 1 is a flow chart illustrating the steps of a single-camera multi-mode gaze tracking method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps of a gaze tracking method according to an embodiment of the present invention;

FIG. 3 is a network diagram of an attention-enhanced adaptive three-dimensional face alignment method according to an embodiment of the present invention;

FIG. 4 is a block diagram of a method for fast tracking a reference face of a user according to an embodiment of the present invention;

FIG. 5 is a block diagram of a spatially-defined eye movement vector gaze tracking method according to an embodiment of the present invention;

fig. 6 is a flowchart of a scale hough transform method according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating an eye movement vector determination process according to an embodiment of the present invention;

FIG. 8 is a diagram of iris localization results provided by an embodiment of the present invention;

fig. 9 is a block diagram showing a configuration of a single-camera multi-mode gaze tracking device according to an embodiment of the present application;

fig. 10 is a block diagram showing a configuration of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The single-camera multi-mode gaze tracking scheme provided by the embodiments of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

Fig. 1 is a flowchart illustrating steps of a single-camera multi-mode gaze tracking method according to an embodiment of the present application.

The single-camera multi-mode sight tracking method comprises the following steps:

step 101: and aligning the face information acquired by the camera by using an attention-enhanced self-adaptive three-dimensional face alignment method to obtain a three-dimensional face alignment result.

A method for aligning face information collected by a camera to obtain a three-dimensional face alignment result optionally through an attention-enhanced adaptive three-dimensional face alignment method comprises the following substeps:

the first substep: determining a self-adaptive face detection method, and collecting face information by adopting the determined self-adaptive face detection method;

specifically, the flow of determining the adaptive face detection method is as follows:

firstly, removing irrelevant feature points in face information acquired by a camera by adopting twice pooling and a cascade correction linear unit;

secondly, acquiring feature information of different structures by using the Incepotion feature module as a backbone network;

and finally, carrying out shape self-adaptation, and extracting features of different scales through a plurality of different convolution kernels after the backbone network by adopting a single-mapping multi-frame prediction method.

And a second substep: aiming at face information collected by a camera, extracting a multi-scale image receptive field through a multi-scale convolution kernel, and extracting face features in the face information by utilizing a deformation convolution to determine the deformed face features;

when the multi-scale image receptive field is extracted by the multi-scale convolution kernel, the face features are further extracted by utilizing the deformation convolution, the face features possibly having deformation are regressed, and the face detection accuracy is improved.

And a third substep: and (3) introducing an attention mechanism in the process of identifying the three-dimensional face key points of the face information acquired by the camera to obtain a three-dimensional face alignment result.

Aiming at the identification work of the key points of the three-dimensional face, under the condition that the quantity of calculation parameters is low by adopting the characteristics of the UV position diagram, aiming at the condition that the image characteristics are weak in extraction, an attention mechanism is introduced in the characteristic decoding process, and the effect of the image texture characteristics in the UV diagram is enhanced.

Specifically, the method for obtaining the three-dimensional face alignment result by introducing an attention mechanism in the process of identifying the three-dimensional face key points of the face information acquired by the camera may include the following steps:

firstly, extracting three-dimensional face features based on a heat map, extracting feature maps of two-dimensional key points by adopting a DenseNet front two-layer structure, and integrating the features into 69 key point heat maps, wherein 68 key points are used, and 1 key point is used as a background;

secondly, stackable residual structure based on attention mechanism is adopted for t-feature decoding.

Step 102: and calibrating the human face reference characteristics, and solving a human face reference vector in the sight tracking process.

Step 103: and calculating normal vectors of triangular surfaces of a plurality of key points on the face mesh by adopting a three-dimensional face alignment result.

Step 104: and (3) representing a reference projection vector by taking the center of the three-dimensional face as a starting point, and fitting the sight line transfer matrix to obtain a dynamic reference vector of the flexible head in a three-dimensional space.

Step 105: and calibrating the characteristics of the user according to the interaction habits of different users, and establishing the specific information of the user to be tracked.

When specific information of a sight tracking user is established, the user can be set to carry out calibration at a position 30cm away from a computer screen (the face is parallel to the screen), and a calibration point is the center of the screen. Initial calibration data is obtained by the provided key point detection method, the rest calibration information can be gradually deduced in an imaging mode, and a spatial matrix is calculated to obtain a calibration result so as to realize real-time alignment.

Steps 102 to 105 are to adopt a three-dimensional face alignment result and a single-point feature calibration mode to construct a face-reference-based rapid sight tracking method, and establish a specific flow of a feature mapping relation, wherein the method can ensure short-time frequent sight tracking operation.

In an optional embodiment, after establishing the specific information of the user to be tracked, the existing error can be corrected.

In correcting for errors, the deviations of the head of the person due to movements parallel to the screen, movements perpendicular to the screen, and left and right rotations can be deduced. A control factor alpha is introduced to the offset problem caused by head rotation, alpha is 0.1, the coordinates of the intersection point of the head orientation vector of the user on the screen are controlled through alpha, namely D (X _ D) is changed into alpha D (X _ D), and the offset parameter is set in the vertical direction.

Step 106: in the continuous sight line interaction process, acquiring iris data of human eyes in a de-noised image with a prominent edge by utilizing Hough transform; and determining an eye movement reference vector according to the iris data of the human eyes and the human face base vector to finish the sight tracking.

In an optional embodiment, during continuous sight interaction, the Hough transform is utilized to obtain the iris data of the human eye in the denoised image with the prominent edge; determining an eye movement reference vector according to the human eye iris data and the human face base vector, and when the sight tracking is completed, the method comprises the following substeps:

the first substep: in the continuous sight line interaction process, based on the face reference vector, an iris identification method based on scale-fixed Hough features is adopted to position the dynamic iris center of the eye movement vector;

the iris recognition method based on the scale-fixed Hough feature mainly adopts a three-dimensional face alignment method to obtain a human eye frame and cut out positions of two eyes, and each human eye image is independently processed, and the method comprises the following specific steps:

firstly, the method of judging the aspect ratio threshold value is adopted for judging the behavior noise of the human, namely judging whether the human eyes are in the blinking process.

Secondly, the screened pictures are applied with a Gaussian filter to remove interference.

Thirdly, the edge characteristics of the image are enhanced by using the result of the Gaussian filtering in the normal distribution form, the gradient change direction of the image is calculated, and the gradient change is enhanced. And adopting a non-maximum value suppression mode to further filter the non-maximum gradient features, reserving edges larger than the gradient value, and setting the gray level of the edges smaller than the gradient value as 0.

And finally, detecting the edge by adopting a double-threshold method.

And a second substep: selecting key point coordinates with stable characteristics in the key points of the human face as reference coordinates, and calculating the central coordinates of the iris of the eyes and the key characteristics of the three-dimensional space information of the reference vector of the human face;

specifically, the method for calculating the key characteristics of the three-dimensional space information of the iris center coordinates and the face reference vectors of the two eyes comprises the following specific steps:

firstly, 33 dense points (including canthus) with obvious and fixed structural features in the human face are used as a reference point set of eye movement vectors, and 33 three-dimensional points are obtained by identifying images collected by a single camera through a method provided by the third chapter.

Secondly, the method aims at the problem that the characteristics are lost due to the occlusion problem, so that the information of one eye is abnormal or the information of a part of face is symmetrically abnormal. The final eye movement vector is the result of integrating the two eye movement vectors, and the integrated vector of the binocular eye movement vectors is limited in a manner that defines a threshold.

Wherein, the eye contour average depth coordinate is used as the iris depth coordinate.

And a third substep: and according to the key characteristics of the three-dimensional space information of the face reference vector, completing eye movement vector sight tracking by adopting a sight tracking regression function.

And in the third substep, eye movement vector sight tracking is completed by adopting a sight tracking regression function according to key features of the three-dimensional space information of the face reference vector. The initial features are calibrated by using a 9-point calibration method, a regression function of mixing the primary and secondary eye movement vectors and the head reference is used, and the depth vector regression space influence of the iris is adopted.

According to the single-camera multi-mode sight tracking method provided by the embodiment of the application, on one hand, the problems of equipment dependence and environment dependence are solved through a face detection method with a self-adaptive scale and a sight tracking method based on single-camera multi-mode, the accuracy and the real-time performance of sight tracking are improved, and the cost of a sight tracking interaction mode is reduced; the second aspect has wide application range, can be applied to simpler environment and environment with higher complexity, and is applicable to the problems of partial shielding, distortion, deformation and the like of the human face.

Fig. 2 is a flowchart illustrating steps of a single-camera multi-mode gaze tracking method according to an embodiment of the present invention; the single-camera multi-mode sight tracking method provided by the embodiment of the application comprises the following steps:

step (1): and carrying out alignment operation on the face information acquired by the single camera through an attention-enhanced self-adaptive three-dimensional face alignment method.

Step (2): a rapid sight tracking method based on a face reference is constructed by adopting a three-dimensional face alignment result and a single-point feature calibration mode, and a feature mapping relation is established to ensure short-time frequent sight tracking operation.

And (3): for continuous sight line interaction behaviors, an eye movement vector sight line tracking method of the face reference characteristics is constructed, eye iris data in a de-noised image with a prominent edge is obtained by Hough transformation, and eye movement reference vectors are formed by matching with face base vectors to finish sight line tracking.

Fig. 3 is a network diagram of an attention-enhanced adaptive three-dimensional face alignment method according to an embodiment of the present invention.

As further shown in fig. 3, step (1): the attention-enhanced self-adaptive three-dimensional face alignment method aligns face information acquired by a single camera and comprises the following specific steps (1-1) to (1-3).

Step (1-1): the recognition time of a face detection frame is reduced by a single-stage target detection method, and a self-adaptive face detection method is provided.

Step (1-2): when the multi-scale image receptive field is extracted by the multi-scale convolution kernel, the face features are further extracted by utilizing the deformation convolution, the face features possibly having deformation are regressed, and the face detection accuracy is improved.

Step (1-3): aiming at the identification work of the key points of the three-dimensional face, under the condition that the quantity of calculation parameters is low by adopting the characteristics of the UV position diagram, aiming at the condition that the image characteristics are weak in extraction, an attention mechanism is introduced in the characteristic decoding process, and the effect of the image texture characteristics in the UV diagram is enhanced.

Further, in the step (1-1), the recognition time of a face detection frame is reduced by a single-stage target detection method, and an adaptive face detection method is provided, which comprises the following specific steps (1-1-1) to (1-1-2):

step (1-1-1): characteristic extraction: the network initially uses a 7 × 7, two 5 × 5 large convolution kernels to obtain a large receptive field, each convolution adopts a sliding step length with the length of 2 to reduce a feature space, irrelevant feature points are eliminated by adopting twice pooling, feature modules of the inclusion are used as backbone networks to obtain feature information of different structures, several convolution scales of 1 × 1, 3 × 3, 5 × 5, 7 × 7 and the like are set, a path capable of representing the 7 × 7 module is added to a basic model of the inclusion v2, and the richer receptive field is obtained by fusing features of different layers.

Table 1 below is a table of feature extraction networks:

step (1-1-2): shape self-adaptation: one method is to reduce the training difficulty through different scale priori frames, use different default scales on different feature layers, set different aspect ratios to form feature maps of different scales, and train through selecting a prediction frame closest to a fixed scale. The default box size for each feature map is calculated as follows:

the other prediction of different scales is realized by enhancing the self-adaptive capacity of the network to the shape, the characteristic graph output by the main network is subjected to characteristic integration by adopting 1 × 1 convolution firstly, and then the station represents the relationship between each sampling point and surrounding points by adding an offset on each convolution sampling to acquire the characteristic effect of the deformation scale. Firstly, the input characteristic diagram is passed through a common convolution to obtain the offset of pixel, and the offset is added on the index of original characteristic diagram to make point P on the original characteristic diagram₀Go to P_lPosition, and P_lSince the position coordinates of (2) are derived from the convolution result, 4-integer real coordinates P representing the existing whole number can be obtained_l：

[[f(x)，f(y)]，[f(x)，e(y)]，[e(x)，f(y)]，[e(x)，f(y)]]

Further, step (1-3) adopts UV position map features, and for the case of weak image feature extraction, an attention mechanism is introduced in the process of decoding the features, and the method comprises the following specific steps (1-3-1) to (1-3-2):

step (1-3-1): extracting three-dimensional face features based on the heat map, namely extracting the feature map of two-dimensional key points by adopting a DenseNet front two-layer structure, and integrating the features into 69 key point heat maps (wherein 68 key points and 1 background).

Table 2 below is a feature extraction network:

step (1-3-2): human face alignment based on attention decoding, namely performing t-feature decoding by adopting a stackable residual error structure based on an attention mechanism.

Table 3 below shows the PRNet decoder structure:

further, as shown in fig. 4, in the step (2), a rapid eye tracking method based on a face reference is constructed by using a three-dimensional face alignment result and a single-point feature calibration mode, and a feature mapping relationship is established to ensure a short-time frequent eye tracking operation, and the method comprises the following specific steps:

step (2-1): firstly, calibrating the human face reference characteristics, solving a specific human face reference vector in the process of tracing a sight line, calculating the sum of normal vectors of triangular surfaces of a plurality of key points on a face grid by adopting a three-dimensional human face alignment result, representing a reference projection vector by using a three-dimensional human face center as a starting point, fitting a sight line transfer matrix, deriving a dynamic reference vector of a flexible head in a three-dimensional space, calibrating the characteristics of a user according to the interaction habits of different users, and establishing the specific information of the sight line tracing user.

Step (2-2): and secondly to correct for errors that are present.

Further, the step (2-1) sets the user to perform calibration at a position 30cm away from the computer screen (the face is parallel to the screen) aiming at the information specific to the resume sight tracking user, and the calibration point is the center of the screen. The initial calibration data is obtained by the key point detection method provided in the step (1), the rest calibration information can be gradually deduced in an imaging mode, and a spatial matrix is calculated to obtain a calibration result so as to realize real-time alignment.

Further, the step (2-2) corrects the error, which is derived for the deviation of the human head in the motion parallel to the screen, the motion perpendicular to the screen, and the left-right rotation. A control factor alpha is introduced to the offset problem caused by head rotation, alpha is 0.1, the coordinates of the intersection point of the head orientation vector of the user on the screen are controlled through alpha, namely D (X _ D) is changed into alpha D (X _ D), and the offset parameter is set in the vertical direction.

Fig. 5 is a frame diagram of a spatially-defined eye movement vector gaze tracking method according to an embodiment of the present invention.

Further, as shown in fig. 5, the eye movement vector eye tracking method for the face reference feature is constructed for the continuous line of sight interaction behavior in step (3), eye iris data in a de-noised image with a protruded edge is obtained by using hough transform, and an eye movement reference vector is formed by matching with a face base vector to complete eye movement tracking, and the eye movement vector eye tracking method comprises the following specific steps:

step (3-1): and positioning the dynamic iris center of the eye movement vector by adopting an iris identification method based on scale fixation Hough characteristics according to the face reference vector obtained in the last step.

A flow chart of the scale hough transform method is shown in fig. 6.

Step (3-2): and then selecting key point coordinates of face outlines, nose tips, eye outlines and the like with stable characteristics in the key points of the human faces as reference coordinates, and calculating the central coordinates of the irises of the two eyes (the average depth coordinates of the eye outlines are used as the depth coordinates of the irises) and the key characteristics of three-dimensional space information of the reference vectors of the human faces.

Step (3-3): and completing eye movement vector sight tracking by adopting a sight tracking regression function according to key features of the three-dimensional space information of the face reference vector.

Further, the iris recognition method based on the scale-fixed Hough feature in the step (3-1) mainly adopts the three-dimensional face alignment method in the step (1) to obtain the eye frames and cut out the positions of the two eyes, and each eye image is independently processed, and the method comprises the following specific steps:

step (3-1-1): judging the behavioral noise of the human: the human eye contour may be represented by P₁,P₂,P₃，P₄，P₅，P₆It is shown that the Ratio of the human Eye edge, i.e. the Aspect Ratio (EAR), can be calculated from these 6 points:

in the embodiment of the invention, the threshold value is set to be 0.2, and the state is considered to be in a blinking stage when the threshold value is less than 0.2.

Step (3-1-2): the Gaussian filter removes interference: for any point P (x, y) on the picture, assuming that its gray value is f (x, y), after filtering, it becomes:

step (3-1-3): enhancing the edge characteristics: and enhancing the edge characteristics of the image by using the result of Gaussian filtering in a normal distribution form, calculating the gradient change direction of the image, and enhancing the gradient change.

Step (3-1-4): filtering the non-maximal gradient features: and multiplying each point by a sobel operator to obtain the gray value change degree, namely the gradient change result characterization edge. However, the edge amplification still exists in the filtering process, non-maximum gradient features are further filtered by adopting a non-maximum suppression mode, edges larger than the gradient value are reserved, and the gray level of the edges smaller than the gradient value is set to be 0.

Step (3-1-5): detecting edges: and carrying out double-threshold detection on the edge feature map, removing features except for the lower threshold and the upper threshold, updating the intermediate data to be 1, and emphasizing the eye features.

Further, the step (3-2) of calculating the central coordinates of the irises of the two eyes and the key features of the three-dimensional space information of the human face reference vector comprises the following specific steps:

step (3-2-1): reference point set: 33 dense points with obvious and fixed structural features in human face are adopted

(including canthus) as a reference point set of eye movement vectors, and 33 three-dimensional points are obtained by identifying images acquired by a single camera by the method provided by the third chapter. The coordinate point calculation formula of the final reference point is as follows:

step (3-2-2): integrating the vectors: the method aims at the problem that the missing features cause information aberration of one eye or information symmetry aberration of a part of face due to the occlusion problem. The final eye movement vector is the result of integrating the two eye movement vectors, and the integrated vector of the binocular eye movement vectors is limited in a manner of defining a threshold, and the final eye movement vector can be expressed as:

and further, completing eye movement vector sight tracking by adopting a sight tracking regression function according to key features of the three-dimensional space information of the face reference vector in the step (3-3). The initial features are calibrated by using a 9-point calibration method, a regression function of mixing the primary and secondary eye movement vectors and the head reference is used, and the depth vector regression space influence of the iris is adopted.

The eye movement vector determination process diagram is shown in fig. 7, and the iris positioning result diagram is shown in fig. 8.

According to the single-camera multi-mode sight tracking method provided by the embodiment of the invention, the problems of equipment dependence and environment dependence are solved through the self-adaptive scale face detection method and the single-camera multi-mode sight tracking method, the accuracy and the real-time performance of sight tracking are improved, and the cost of a sight tracking interaction mode is reduced.

Fig. 9 is a block diagram of a single-camera multi-mode gaze tracking device according to an embodiment of the present application.

The single-camera multi-mode sight tracking device comprises the following functional modules:

the alignment module 201 is configured to align face information acquired by a camera through an attention-enhanced adaptive three-dimensional face alignment method to obtain a three-dimensional face alignment result;

the calibration module 202 is used for calibrating the face reference features and solving a face reference vector in the sight tracking process;

the calculation module 203 is configured to calculate normal vectors of triangular surfaces of a plurality of key points on the face mesh by using the three-dimensional face alignment result;

the fitting module 204 is configured to represent a reference projection vector by taking a three-dimensional face center as a starting point, and fit the sight line transfer matrix to obtain a dynamic reference vector of the flexible head in a three-dimensional space;

the establishing module 205 is configured to calibrate characteristics of a user according to interaction habits of different users, and establish specific information of the user to be tracked;

the tracking module 206 is configured to obtain iris data of a human eye in a denoised image with a prominent edge by using hough transform in a continuous line-of-sight interaction process; and determining an eye movement reference vector according to the human eye iris data and the human face base vector to finish sight tracking.

Optionally, the alignment module comprises:

and the third sub-module is used for introducing an attention mechanism in the process of identifying the three-dimensional face key points of the face information acquired by the camera to obtain a three-dimensional face alignment result.

Optionally, when the first sub-module determines the adaptive face detection method, the first sub-module is specifically configured to:

Optionally, the third sub-module is specifically configured to:

Optionally, the tracking module comprises:

The embodiment of the application provides a single-camera multi-mode sight tracking device.

In the embodiment of the present application, the single-camera multimode gaze tracking device shown in fig. 9 may be a device, or may be a component, an integrated circuit, or a chip in a server. The single-camera multimode gaze tracking device shown in fig. 2 in the embodiments of the present application may be a device having an operating system. The operating system may be an Android operating system (Android), an iOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.

The single-camera multi-mode gaze tracking device shown in fig. 9 provided in the embodiment of the present application can implement each process implemented by the method embodiment of fig. 1, and is not described herein again to avoid repetition.

Optionally, as shown in fig. 10, an electronic device 300 is further provided in this embodiment of the present application, and includes a processor 301, a memory 302, and a program or an instruction stored in the memory 302 and capable of being executed on the processor 301, where the program or the instruction is executed by the processor 301 to implement each process of the above-mentioned single-camera multi-mode gaze tracking method embodiment, and can achieve the same technical effect, and is not repeated here to avoid repetition.

It should be noted that the electronic device in the embodiment of the present application includes the server described above.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the single-camera multi-mode gaze tracking method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, the processor is configured to run a program or an instruction, implement each process of the above-mentioned single-camera multi-mode gaze tracking method embodiment, and can achieve the same technical effect, and for avoiding repetition, the details are not repeated here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A single-camera multi-mode gaze tracking method, the method comprising:

2. The method according to claim 1, wherein the step of aligning the face information collected by the camera to obtain the three-dimensional face alignment result by the attention-enhanced adaptive three-dimensional face alignment method comprises:

3. The method of claim 2, wherein the step of determining an adaptive face detection method comprises:

4. The method of claim 2, wherein an attention mechanism is introduced in the process of performing three-dimensional face key point recognition on the face information acquired by the camera, and obtaining a three-dimensional face alignment result comprises:

5. The method according to claim 1, wherein during continuous line-of-sight interaction, the eye iris data in the denoised image with prominent edges is obtained by using Hough transform; according to the human eye iris data and the human face base vector, determining an eye movement reference vector to complete sight tracking, wherein the method comprises the following steps:

6. A single-camera, multi-mode gaze tracking device, the device comprising:

7. The apparatus of claim 6, wherein the alignment module comprises:

8. The apparatus according to claim 7, wherein the first sub-module, when determining the adaptive face detection method, is specifically configured to:

obtaining feature information of different structures by using the feature module of the inclusion as a backbone network;

9. The apparatus of claim 7, wherein the third sub-module is specifically configured to:

extracting three-dimensional face features based on the heat map, extracting feature maps of two-dimensional key points by adopting a DenseNet front two-layer structure, and integrating the features into 69 key point heat maps, wherein 68 key points are provided, and 1 background is provided;

10. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the single-camera multi-mode gaze tracking method of any of claims 1-5.