CN116091541A

CN116091541A - Eye movement tracking method, eye movement tracking device, electronic device, storage medium, and program product

Info

Publication number: CN116091541A
Application number: CN202211653531.1A
Authority: CN
Inventors: 张环宇
Original assignee: Zeku Technology Shanghai Corp Ltd
Current assignee: Zeku Technology Shanghai Corp Ltd
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-05-09

Abstract

The application relates to an eye movement tracking method, an eye movement tracking device, an electronic device, a storage medium and a program product. The method comprises the following steps: acquiring an image to be detected; performing eye movement tracking according to the image to be detected and a preset detection model to obtain an eye movement tracking result; the detection model is obtained by training an initial detection model according to a sample image, an enhanced image corresponding to the sample image and a teacher model. By adopting the method, eye movement tracking can be accurately performed.

Description

Eye movement tracking method, eye movement tracking device, electronic device, storage medium, and program product

Technical Field

The present invention relates to the field of eye tracking technology, and in particular, to an eye tracking method, an eye tracking device, an electronic apparatus, a storage medium, and a program product.

Background

With the development of pattern recognition and computer vision technology, eye tracking technology has been widely used. Eye tracking techniques estimate the gaze point of an eye by measuring the motion of the eye, thereby tracking changes in the eye in real time, and predicting the state and needs of the user based on the changes in the eye.

In the conventional technology, a trained neural network model is mainly used for learning a mapping relation between eye features in a face image and a gazing direction so as to track eye movements. However, the conventional eye tracking method has a problem of low accuracy.

Disclosure of Invention

The embodiment of the application provides an eye movement tracking method, an eye movement tracking device, electronic equipment, a storage medium and a program product, which can accurately perform eye movement tracking.

In a first aspect, an embodiment of the present application provides an eye tracking method, including:

acquiring an image to be detected;

performing eye movement tracking according to the image to be detected and a preset detection model to obtain an eye movement tracking result; the detection model is obtained by training an initial detection model according to a sample image, an enhanced image corresponding to the sample image and a teacher model.

In a second aspect, embodiments of the present application provide an eye tracking device comprising:

the first acquisition module is used for acquiring an image to be detected;

the second acquisition module is used for carrying out eye movement tracking according to the image to be detected and a preset detection model to obtain an eye movement tracking result; the detection model is obtained by training an initial detection model according to a sample image, an enhanced image corresponding to the sample image and a teacher model.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program, where the computer program, when executed by the processor, causes the processor to perform the steps of the eye tracking method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the eye tracking method according to the first aspect.

According to the eye movement tracking method, the eye movement tracking device, the electronic equipment, the storage medium and the program product, the preset detection model is obtained by training the initial detection model according to the sample image, the enhanced image corresponding to the sample image and the teacher model, and the initial detection model can be accurately trained through the sample image, the enhanced image corresponding to the sample image and the teacher model, so that the accuracy of the obtained detection model is higher, eye movement tracking with higher accuracy can be performed according to the image to be detected and the preset detection model, and eye movement tracking results with higher accuracy are obtained.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of an application environment of an eye tracking method in one embodiment;

FIG. 2 is a flow chart of a method of eye tracking in one embodiment;

FIG. 3 is a flow chart of a method of eye tracking in another embodiment;

FIG. 4 is a flow chart of a method of eye tracking in another embodiment;

FIG. 5 is a flow chart of a method of eye tracking in another embodiment;

FIG. 6 is a flow chart of a method of eye tracking in another embodiment;

FIG. 7 is a flow chart of a method of eye tracking in another embodiment;

FIG. 8 is a flow chart of a method of eye tracking in another embodiment;

FIG. 9 is a flow chart of a method of eye tracking in another embodiment;

FIG. 10 is a flow chart of a method of eye tracking in another embodiment;

FIG. 11 is a schematic diagram of a training process of a detection model in one embodiment;

FIG. 12 is a block diagram of an eye tracking device in one embodiment;

FIG. 13 is a block diagram of an eye tracking device according to another embodiment;

FIG. 14 is a block diagram of an eye tracking device according to another embodiment;

FIG. 15 is a block diagram of an eye tracking device according to another embodiment;

FIG. 16 is a block diagram of an eye tracking device according to another embodiment;

FIG. 17 is a block diagram of an eye tracking device according to another embodiment;

FIG. 18 is a block diagram of an eye tracking device according to another embodiment;

FIG. 19 is a block diagram of an eye tracking device according to another embodiment;

FIG. 20 is a schematic diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The eye tracking method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the electronic device 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The electronic device 102 may send an acquisition request to the server 104, and acquire corresponding information from the server 104. The electronic device 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, an eye tracking method is provided, which is illustrated by using the method applied to the electronic device in fig. 1 as an example, and includes the following steps:

s201, acquiring an image to be detected.

The image to be detected is an image of a user using the electronic device. Optionally, the image to be detected may include a face, or may not include a face. Optionally, the image to be detected may be an image obtained by shooting with a camera of the electronic device. Optionally, the acquired image to be detected may be a front image of the user, or may be a side image of the user, or the like, or the acquired image to be detected may be an image of the user far from the electronic device, or may be an image of the user near to the electronic device. Alternatively, the image to be detected may be an image acquired when the user uses the electronic device to read, or may be an image acquired when the user uses the electronic device to watch the video, or the like, which is not limited herein. Optionally, the acquired image to be detected may be an image including one user, or may be an image including a plurality of users.

S202, eye movement tracking is carried out according to an image to be detected and a preset detection model, and an eye movement tracking result is obtained; the detection model is obtained by training an initial detection model according to the sample image, the enhanced image corresponding to the sample image and the teacher model.

The enhanced image corresponding to the sample image is obtained by performing transformation processing such as overturning and translation on the sample image, and further, the detection model is obtained by performing weak supervision training on the initial detection model in advance according to the sample image, the enhanced image corresponding to the sample image and the teacher model, that is, in the process of training the initial detection model, the initial detection model can be used as a student model, the teacher model is used for guiding learning training of the initial detection model, the enhanced image corresponding to the sample image and the sample image is used for training the initial detection model, and then a trained detection model is obtained, so that eye movement tracking is performed according to the trained detection model and the image to be detected, and an eye movement tracking result is obtained.

It will be appreciated that the eye tracking results from the image to be detected and the detection model may be applied to different eye tracking scenes. The obtained eye tracking result may be applied to a scene where the gaze is not extinguished, i.e. the screen of the electronic device is controlled to be bright or extinguished according to the eye tracking result, or the obtained eye tracking result may be applied to a scene where the page is automatically turned, i.e. when the automatic reading function of the electronic device is used, the screen of the electronic device may be controlled to turn the page according to the eye tracking result, or the obtained eye tracking result may be applied to a scene where the eye protection is prompted, i.e. the distance between the eyes of the person and the camera of the electronic device is determined according to the eye tracking result, and the eye protection is prompted according to the distance between the eyes of the person and the camera of the electronic device.

In the eye movement tracking method, the preset detection model is obtained by training the initial detection model according to the sample image, the enhanced image corresponding to the sample image and the teacher model, and the initial detection model can be accurately trained through the sample image, the enhanced image corresponding to the sample image and the teacher model, so that the accuracy of the obtained detection model is higher, eye movement tracking with higher accuracy can be performed according to the image to be detected and the preset detection model, and eye movement tracking results with higher accuracy are obtained.

In the above-mentioned scene of carrying on eye movement and tracing according to waiting to detect the picture and detecting the model and preserving, get eye movement and tracing the result, in an embodiment, the above-mentioned detecting the model includes detecting the network and tracing the network; as shown in fig. 3, S202 includes:

s301, inputting the image to be detected into a detection network to obtain the key points of the human face in the image to be detected.

The face key points may represent the head pose of the user, for example, the face key points may include the center point of 2 eyeballs in the face, 1 point at the tip of the nose and 1 point at the mandible. Optionally, in this embodiment, the image to be detected may be input into a detection network to perform feature extraction, and the image to be detected is analyzed by using the extracted features to obtain the face key points in the image to be detected. Optionally, before the image to be detected is input into the detection network, smoothing, filtering and other treatments can be performed on the image to be detected, so that noise points in the image to be detected are removed, and then the treated image to be detected is input into the detection network, so as to obtain key points of faces in the image to be detected.

S302, acquiring an eye movement tracking result according to the face key points and the tracking network.

Alternatively, in this embodiment, the face key points may be input into the tracking network to obtain the eye movement tracking result, for example, the obtained eye movement tracking result may be a gaze point estimation result of the human eye; or, the key points of the human face can be input into a tracking network to obtain the human eye image in the image to be detected, and then the obtained human eye image is utilized to carry out eye movement tracking to obtain an eye movement tracking result.

In this embodiment, the image to be detected is input into the detection network, so that the face key points in the image to be detected can be accurately obtained, the eye movement tracking result can be accurately obtained according to the face key points and the tracking network, and the accuracy of the obtained eye movement tracking result is improved.

In this embodiment, a process of obtaining a face key point in an image to be detected will be described in detail. In one embodiment, the detection network includes a backbone network, a face recognition network, and a key point detection network; as shown in fig. 4, S301 includes:

s401, inputting the image to be detected into a backbone network to obtain the characteristics of the image to be detected.

The backbone network may be a lightweight MobileNetV2 network, or may also be a resnet network, where the type of backbone network is not limited in this embodiment, and a corresponding backbone network may be selected according to the processing capability of the electronic device. Optionally, in this embodiment, feature extraction may be performed on the image to be detected by using a convolution layer in the backbone network, so as to obtain features of the image to be detected.

S402, inputting the features of the image to be detected into a face recognition network to obtain a face image corresponding to the image to be detected.

In this embodiment, a face recognition network may be connected after the backbone network, and features of the image to be detected extracted from the backbone network may be analyzed and identified through the face recognition network, so as to output a face image corresponding to the image to be detected. It should be noted that, the face image corresponding to the image to be detected is a face partial image in the image to be detected, and in addition, the face image obtained in the embodiment is a face image including a left face and a right face, and if the identified face image includes only the left face image or only the right face image, the face image including only the left face image or only the right face image may be removed.

S403, inputting the face image into a key point detection network to obtain the face key points.

In this embodiment, the key point detection network is a pre-trained network capable of identifying key points in the face image, and the obtained face image can be input into the key point detection network, and the face image is identified through the key point detection network, so as to obtain the face key points in the face image.

In this embodiment, the image to be detected is input into the backbone network of the detection network, and the feature of the image to be detected can be accurately obtained through the backbone network, so that the feature of the image to be detected can be input into the face recognition network of the detection network, the face image corresponding to the image to be detected can be accurately obtained, and then the obtained face image can be input into the key point detection network of the detection network, the face image can be accurately processed through the key point detection network, and the face key points in the face image can be accurately obtained.

In the above-mentioned scene in which the eye movement tracking result is obtained according to the face key point and the tracking network of the detection model, in one scene, the obtained eye movement tracking result may be a gaze point estimation result. In one embodiment, as shown in fig. 5, S302 includes:

s501, inputting the face key points and the face images into a recognition network to obtain the eye images corresponding to the images to be detected.

In this embodiment, the obtained face key points and the face images may be input into a recognition network of a tracking network, and the face images may be intercepted by the recognition network to obtain a human eye image, for example, the recognition network may intercept left and right eye images corresponding to the image to be detected with the center of 2 eyeballs in the face key points as a center point and with the width of 1/3 of the face in the face image as an edge; or, the recognition network may take the center of 2 eyeballs in the key points of the face as a center point, take 1/4 face width in the face image as an edge, and intercept left and right eye patterns to obtain an eye image corresponding to the image to be detected.

S502, obtaining a first gaze point estimation result according to the face key points and the eye images.

Optionally, in this embodiment, the tracking network may further include an analysis network, and the face key points and the eye images may be input into the analysis network to obtain a first gaze point estimation result; or, the face key points and the eye images may be analyzed by a preset algorithm to obtain the first gaze point estimation result. Further, in this embodiment, if the first gaze point estimation result indicates that the gaze point of the human eye is located in an area outside the screen of the electronic device in the screen display state of the electronic device is in the bright screen state, for example, if the first gaze point estimation result indicates that the gaze point of the human eye is located in an area extending outward by 1cm outside the screen of the electronic device, the screen of the electronic device may be controlled to be turned off, that is, in the screen display state of the electronic device is in the bright screen state, the screen may not be turned off when the human eye gazes at the screen of the electronic device, and the screen may not be turned off when the human eye gazes at the screen of the electronic device according to the first gaze point estimation result. Or as another optional implementation manner, in this embodiment, if the first gaze point estimation result indicates that the gaze point is located on the screen of the electronic device in the screen display state of the electronic device is in the off-screen state, the screen of the electronic device is controlled to be on, that is, when the human eye gazes at the screen of the electronic device according to the first gaze point estimation result in the screen display state of the electronic device is in the off-screen state.

In this embodiment, by inputting the face key points and the face images into the recognition network, the eye images corresponding to the images to be detected can be accurately obtained, and further the first gaze point estimation result can be accurately obtained according to the face key points and the eye images, so that the accuracy of the obtained first gaze point estimation result is ensured.

In the above scene of obtaining the human eye image, the iris-segmented human eye image may be further obtained based on the human eye image, and the eye movement tracking may be performed by using the iris-segmented human eye image. On the basis of the above embodiment, in one embodiment, the tracking network further includes a segmentation network and an estimation network, as shown in fig. 6, and the method further includes:

s601, inputting the human eye image into a segmentation network to obtain the iris segmented human eye image.

The segmentation network may be a network composed of a convolution layer and a pooling layer, and in this embodiment, the human eye image may be input into the segmentation network, and the input human eye image is processed through the convolution layer and the pooling layer to obtain the iris segmented human eye image.

S602, inputting the human face key points, the human eye images and the human eye images after iris segmentation into an estimation network to obtain a second gaze point estimation result.

In this embodiment, the segmentation network may be followed by an estimation network, through which the face key points, the eye image, and the eye image after iris segmentation are processed to obtain the second gaze point estimation result. It can be understood that the second gaze point estimation result obtained in this embodiment is obtained by using the face key point, the human eye image and the iris segmented human eye image, and the human eye image after iris segmentation can obtain more abundant human eye information, so that the accuracy of the obtained gaze point estimation result can be further ensured, and the obtained second gaze point estimation result is more accurate.

And S603, if the second gaze point estimation result indicates that the gaze point is located in the preset area of the screen of the electronic device, controlling the electronic device to perform page turning operation.

In this embodiment, the second gaze point estimation result obtained above may be used in the intelligent reading mode of the electronic device, that is, if the second gaze point estimation result indicates that the gaze point is located in the preset area of the screen of the electronic device, the electronic device is controlled to perform the page turning operation. Optionally, the preset area in this embodiment may be a bottom area of the electronic device, that is, when the second gaze point estimation result indicates that the gaze point is located at the bottom of the screen of the electronic device, the electronic device may be controlled to automatically turn pages, so as to implement intelligent reading.

In this embodiment, by inputting the human eye image into the segmentation network, the human eye image can be accurately segmented through the segmentation network, so as to obtain the human eye image after iris segmentation with higher accuracy, so that the human face key point, the human eye image and the human eye image after iris segmentation can be input into the estimation network, a second gaze point estimation result is obtained, and when the second gaze point estimation result indicates that the gaze point is located in a preset area of the screen of the electronic device, the electronic device is controlled to perform page turning operation, and because the accuracy of the obtained second gaze point estimation result is higher, the accuracy of the page turning operation of the electronic device is also ensured.

After the iris-segmented human eye image is obtained, eye protection prompt can be performed according to the iris-segmented human eye image. In one embodiment, as shown in fig. 7, the method further includes:

s701, acquiring iris areas according to the iris segmented human eye images.

Alternatively, in this embodiment, the iris in the eye image after iris segmentation may be calculated, so as to obtain the area of the iris in the eye image. For example, in this embodiment, the area of the iris in the human eye image may be obtained by using the width and length of the iris in the human eye image after iris segmentation.

S702, obtaining the distance from human eyes to a camera of the electronic equipment according to the iris area.

It will be appreciated that the closer the human eye is to the electronic device camera, the greater the iris area, and the further the human eye is from the electronic device camera, the less the iris area. In this embodiment, the distance from the human eye to the camera of the electronic device may be obtained according to the relationship and the iris area of the iris in the human eye image.

And S703, if the distance is smaller than the preset distance threshold, controlling the electronic equipment to output eye-protection prompt information.

In this embodiment, if it is determined that the distance between the human eye and the camera of the electronic device is smaller than the preset distance threshold, it is indicated that the human eye is too close to the electronic device, and the electronic device may be controlled to output eye protection prompt information to remind the user to keep the distance from the electronic device. Optionally, in this embodiment, the electronic device may be controlled to output text eye-protection prompt information, or the electronic device may be controlled to output voice eye-protection prompt information, which is not limited herein.

According to the iris segmentation method, the iris area in the human eye image can be accurately obtained according to the iris image after iris segmentation, so that the distance from the human eye to the camera of the electronic equipment can be accurately obtained according to the iris area, when the distance from the human eye to the camera of the electronic equipment is smaller than a preset distance threshold value, the electronic equipment is controlled to output eye protection prompt information, the distance between the user and the electronic equipment can be timely and accurately reminded, and the user can protect the eye while using the electronic equipment.

The training process of the detection model used in the above will be described in detail in this embodiment. In one embodiment, the initial detection model includes an initial detection network and an initial tracking network, as shown in fig. 8, and the method further includes:

s801, inputting the sample image into an initial detection network to obtain sample face key points in the sample image.

In this embodiment, the sample image may be input into an initial detection network, and the sample image may be subjected to processing such as feature extraction through the initial detection network, so as to obtain a sample face key point in the sample image. Optionally, the initial detection network in this embodiment may include a neural network layer such as a convolution layer, and the sample image may be processed by the neural network layer such as the convolution layer of the initial detection network, so as to obtain a sample face key point in the sample image.

S802, acquiring a first sample eye movement tracking result according to the sample face key points and the initial tracking network.

Optionally, in this embodiment, the sample face key points may be input into an initial tracking network, and the sample face key points are processed by the initial tracking network to obtain the first sample eye movement tracking result. It can be appreciated that the first sample eye tracking result obtained in this embodiment may be an estimation result of the eye gaze point of the human eye in the sample image.

S803, training the initial detection model according to the first sample eye movement tracking result, the enhanced image and the teacher model to obtain a detection model.

Optionally, in this embodiment, the initial detection model may be used as a student model, the teacher model is used to guide training of the initial detection model, the enhanced image corresponding to the sample image is input into the initial detection model to obtain the output corresponding to the enhanced image, and then the initial detection model is trained by using the obtained first sample eye movement tracking result, the output corresponding to the enhanced image and the teacher model to obtain the detection model.

In this embodiment, a sample image is input into an initial detection network, and a sample face key point in the sample image can be obtained through the initial detection network, so that a first sample eye tracking result can be obtained according to the sample face key point and the initial tracking network, and further, an initial detection model can be trained according to the first sample eye tracking result, an enhanced image corresponding to the sample image and a teacher model.

In the above-mentioned scenario of inputting the sample image into the initial detection network to obtain the sample face key points in the sample image, in one embodiment, the initial detection network includes an initial backbone network, an initial face recognition network, and an initial key point detection network, as shown in fig. 9, the step S801 includes:

s901, inputting a sample image into an initial backbone network to obtain a first sample characteristic of the sample image.

The initial backbone network may be a lightweight MobileNetV2 network, or may also be a resnet network, where the embodiment does not limit the type of the initial backbone network, and may select a corresponding initial backbone network according to the processing capability of the electronic device. Optionally, in this embodiment, feature extraction may be performed on the sample image by using a convolution layer in the initial backbone network, so as to obtain a first sample feature of the sample image.

S902, inputting the first sample characteristic into an initial face recognition network to obtain a sample face image corresponding to the sample image.

In this embodiment, an initial face recognition network may be connected after the initial backbone network, and the first sample feature of the sample image extracted by the initial backbone network is analyzed and identified through the initial face recognition network, so as to output a sample face image corresponding to the sample image. The sample face image corresponding to the sample image is a face partial image in the sample image.

S903, inputting the sample face image into an initial key point detection network to obtain sample face key points.

In this embodiment, the obtained sample face image may be input into an initial key point detection network, and the sample face image is identified through the initial key point detection network, so as to obtain a sample face key point in the sample face image.

In this embodiment, the sample image is input into the initial backbone network of the initial detection network, and the first sample feature of the sample image can be accurately obtained through the initial backbone network, so that the first sample feature of the sample image can be input into the initial face recognition network of the initial detection network, the sample face image corresponding to the sample image can be accurately obtained, and further the obtained sample face image can be input into the initial key point detection network of the initial detection network, the sample face image can be accurately processed through the initial key point detection network, and the sample face key points in the sample face image can be accurately obtained.

In the above-mentioned scenario of training the initial detection model, the initial detection model may be trained using the obtained first sample feature, the first sample eye tracking result, the enhanced image, and the teacher model, to obtain the detection model. In one embodiment, as shown in fig. 10, S803 includes:

S1001, inputting the enhanced image into the initial backbone network to obtain a second sample characteristic of the enhanced image.

Optionally, in this embodiment, feature extraction may be performed on an enhanced image corresponding to the sample image by using a convolution layer in the initial backbone network, so as to obtain a second sample feature of the enhanced image. It can be understood that, if the enhanced image corresponding to the sample image in this embodiment is a plurality of enhanced images corresponding to the sample image, the obtained second sample feature is a second sample feature corresponding to the plurality of enhanced images.

S1002, inputting the sample image into a teacher model to obtain a second sample eye movement tracking result.

In this embodiment, the sample image may be input into the teacher model to obtain a second sample eye movement tracking result, and training of the initial detection model is guided by the second sample eye movement tracking result.

S1003, training an initial detection model according to the first sample characteristic, the second sample characteristic, the first sample eye movement tracking result, the second sample eye movement tracking result and the gold standard eye movement tracking result corresponding to the sample image to obtain a detection model.

Optionally, in this embodiment, as shown in fig. 11, the initial backbone network in the up leg is a multiplexed backbone network MobileNetV2, and the characterization output of the backbone network converts the characterization output of the backbone network into a characterization space of weak supervised learning loss through a converter, where the converter is composed of a convolution layer, a normalization layer, a full connection layer, and a ReLU6 layer, further, the output of the backbone network is sent to a tracking network to perform eye tracking, so as to obtain a first sample eye tracking result, and the first sample eye tracking result and a true value label perform calculation of a supervised learning loss function, that is, training a first part loss function of the initial detection model; in the last branch, a teacher model is also provided, namely, a knowledge distillation method is introduced to carry out model teaching in the earlier stage of training, the teacher model is a large-scale model which is trained by using GazeCapture, MPIIFaceGaze, eyeDiap and other data sets to carry out supervision learning in single-task, namely eye tracking learning, the self structure is composed of a deep convolutional neural network followed by a tracking network, the generalization and the robustness of the model are excellent, but the model is too huge, an initial trunk network and the tracking network are taken as student models, the similarity of the output of the model and the teacher model is the third partial loss function of a training initial detection model, the loss function of the part with better multi-stage teaching effect in the knowledge distillation is composed of two parts, the first part is the similarity evaluation between the final output of the teacher model and the output of the tracking network after the initial trunk network, and the second part is the loss between the final output of the teacher model and the output of the tracking network after the initial trunk network. The converters in the down leg are consistent with the network framework in the up leg, but where the parameters of the converters are the running average in the up leg, the input of the initial backbone network in the up leg is an enhanced image of the sample image, the output of the initial backbone network in the up leg and the output of the initial backbone network in the down leg need to be minimized for similarity, this part is the third part loss function of the initial detection model, this leg implementation is another important part of weak supervised learning, and the characterization consistency of the different views of the same data instance is maximized by using this learning loss in the data characterization space.

Through the above description, in this embodiment, the value of the first loss function of the initial detection model may be obtained according to the first sample feature and the second sample feature, the value of the second loss function of the initial detection model may be obtained according to the first sample eye movement tracking result and the gold standard eye movement tracking result corresponding to the sample image, the value of the third loss function of the initial detection model may be obtained according to the second sample eye movement tracking result and the gold standard eye movement tracking result, and then the value of the first loss function, the value of the second loss function and the value of the third loss function may be used to train the initial detection model to obtain the detection model. Alternatively, in this embodiment, the initial detection model may be trained by using a weighted sum of the value of the first loss function, the value of the second loss function, and the value of the third loss function, to obtain the detection model.

Further, as an optional implementation manner, after training the initial detection model, quantization processing may be performed on the obtained detection model, so as to reduce the network scale of the obtained detection model, and the quantized detection model is deployed in the electronic device, and as an exemplary optional implementation manner, the obtained detection model may be quantized into an 8-bit detection model and further deployed in the electronic device.

In this embodiment, by inputting the enhanced image corresponding to the sample image into the initial backbone network, the second sample feature of the enhanced image can be obtained, and inputting the sample image into the teacher model, the second sample eye movement tracking result can be obtained, so that the initial detection model can be accurately trained by using the obtained first sample feature, second sample feature, first sample eye movement tracking result, second sample eye movement tracking result and gold standard eye movement tracking result corresponding to the sample image, and the detection model with higher accuracy can be obtained.

For ease of understanding by those skilled in the art, the following detailed description of the eye tracking method provided in the present disclosure may include:

s1, inputting a sample image into an initial backbone network of an initial detection network to obtain a first sample characteristic of the sample image.

S2, inputting the first sample characteristics into an initial face recognition network of an initial detection network to obtain a sample face image corresponding to the sample image.

And S3, inputting the sample face image into an initial key point detection network of an initial detection network to obtain sample face key points.

S4, acquiring a first sample eye movement tracking result according to the sample face key points and an initial tracking network of the initial detection network.

S5, inputting the enhanced image into an initial backbone network to obtain a second sample characteristic of the enhanced image.

S6, inputting the sample image into a teacher model to obtain a second sample eye movement tracking result.

S7, acquiring a value of a first loss function according to the first sample characteristic and the second sample characteristic.

S8, obtaining a value of the second loss function according to the first sample eye movement tracking result and the gold standard eye movement tracking result.

S9, obtaining a value of a third loss function according to the second sample eye movement tracking result and the gold standard eye movement tracking result.

And S10, training the initial detection model according to the value of the first loss function, the value of the second loss function and the value of the third loss function to obtain a detection model.

S11, acquiring an image to be detected.

S12, inputting the image to be detected into a backbone network of the detection model to obtain the characteristics of the image to be detected.

S13, inputting the features of the image to be detected into a face recognition network of the detection model to obtain a face image corresponding to the image to be detected.

S14, inputting the face image into a key point detection network of the detection model to obtain the face key points.

S15, inputting the face key points and the face images into a recognition network of the detection model to obtain the eye images corresponding to the images to be detected.

S16, obtaining a first gaze point estimation result according to the face key points and the eye images.

S17, if the first gaze point estimation result indicates that the gaze point is located in an area outside the screen of the electronic device in the state that the screen display state of the electronic device is a bright screen state, controlling the screen of the electronic device to stop; or alternatively, the process may be performed,

and if the first gaze point estimation result indicates that the gaze point is positioned on the screen of the electronic equipment in the screen display state of the electronic equipment is in the screen-off state, controlling the screen of the electronic equipment to lighten.

S18, inputting the human eye image into a segmentation network of the detection model to obtain the iris segmented human eye image.

And S19, inputting the human face key points, the human eye images and the human eye images after iris segmentation into an estimation network of the detection model to obtain a second gaze point estimation result.

And S20, if the second gaze point estimation result shows that the gaze point is located in the preset area of the screen of the electronic device, controlling the electronic device to perform page turning operation.

S21, acquiring iris areas according to the iris segmented human eye images.

S22, obtaining the distance from the human eyes to the camera of the electronic equipment according to the iris area.

S23, if the distance from the eyes to the camera of the electronic equipment is smaller than a preset distance threshold, controlling the electronic equipment to output eye-protection prompt information.

It should be noted that, for the description of the above steps, reference may be made to the description related to the above embodiments, and the effects thereof are similar, which is not repeated herein.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an eye tracking device for realizing the eye tracking method. The implementation of the solution provided by the device is similar to that described in the above method, so specific limitations in one or more embodiments of the eye tracking device provided below may be found in the above limitations of the eye tracking method, and will not be described in detail herein.

In one embodiment, as shown in fig. 12, there is provided an eye tracking apparatus comprising: a first acquisition module 10 and a second acquisition module 11, wherein:

a first acquisition module 10, configured to acquire an image to be detected.

The second acquisition module 11 is configured to perform eye tracking according to the image to be detected and a preset detection model, so as to obtain an eye tracking result; the detection model is obtained by training an initial detection model according to the sample image, the enhanced image corresponding to the sample image and the teacher model.

The eye tracking device provided in this embodiment may perform the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.

On the basis of the above embodiment, as shown in fig. 13, optionally, the above detection model includes a detection network and a tracking network; the second acquisition module 11 includes: a first acquisition unit 111 and a second acquisition unit 112, wherein:

the first obtaining unit 111 is configured to input an image to be detected into the detection network, and obtain a face key point in the image to be detected.

The second obtaining unit 112 is configured to obtain an eye tracking result according to the face key point and the tracking network.

On the basis of the above embodiment, optionally, the detection network includes a backbone network, a face recognition network, and a key point detection network; the first obtaining unit 111 is configured to input an image to be detected into a backbone network to obtain characteristics of the image to be detected; inputting the characteristics of the image to be detected into a face recognition network to obtain a face image corresponding to the image to be detected; and inputting the face image into a key point detection network to obtain the face key points.

Optionally, in accordance with the above embodiment, the tracking network includes an identification network; the second obtaining unit 112 is configured to input the face key points and the face image into the recognition network to obtain an eye image corresponding to the image to be detected; and obtaining a first gaze point estimation result according to the face key points and the eye images.

On the basis of the above embodiment, as shown in fig. 14, optionally, the above apparatus further includes: a first control module 12 and a second control module 13, wherein:

The first control module 12 is configured to control a screen of the electronic device if the first gaze point estimation result indicates that the gaze point is located in an area outside the screen of the electronic device when the screen display state of the electronic device is a bright screen state.

The second control module 13 is configured to control, when the screen display state of the electronic device is a screen-off state, the screen of the electronic device to be on if the first gaze point estimation result indicates that the gaze point is located on the screen of the electronic device.

On the basis of the above embodiment, as shown in fig. 15, optionally, the tracking network further includes a segmentation network and an estimation network, and the apparatus further includes: a segmentation module 14, an estimation module 15 and a third control module 16, wherein:

the segmentation module 14 is configured to input the human eye image into a segmentation network to obtain a human eye image after iris segmentation.

The estimation module 15 is configured to input the face key point, the eye image, and the iris-segmented eye image into an estimation network, to obtain a second gaze point estimation result.

And the third control module 16 is configured to control the electronic device to perform a page turning operation if the second gaze point estimation result indicates that the gaze point is located in a preset area of the screen of the electronic device.

On the basis of the above embodiment, as shown in fig. 16, optionally, the above apparatus further includes: a third acquisition module 17, a fourth acquisition module 18 and a fourth control module 19, wherein:

and a third acquisition module 17, configured to acquire an iris area according to the iris segmented human eye image.

A fourth obtaining module 18, configured to obtain a distance from the human eye to the camera of the electronic device according to the iris area.

And the fourth control module 19 is configured to control the electronic device to output the eye-protection prompt message if the distance is less than the preset distance threshold.

On the basis of the above embodiment, as shown in fig. 17, optionally, the initial detection model includes an initial detection network and an initial tracking network, and the apparatus further includes: a fifth acquisition module 20, a sixth acquisition module 21, and a training module 22, wherein:

and a fifth obtaining module 20, configured to input the sample image into an initial detection network, and obtain a sample face key point in the sample image.

A sixth obtaining module 21, configured to obtain a first sample eye tracking result according to the sample face key point and the initial tracking network.

The training module 22 is configured to train the initial detection model according to the first sample eye tracking result, the enhanced image and the teacher model, so as to obtain a detection model.

On the basis of the above embodiment, as shown in fig. 18, optionally, the initial detection network includes an initial backbone network, an initial face recognition network, and an initial key point detection network; the fifth acquiring module 20 includes: a third acquisition unit 201, a fourth acquisition unit 202, and a fifth acquisition unit 203, wherein:

the third obtaining unit 201 is configured to input the sample image into the initial backbone network, and obtain a first sample feature of the sample image.

The fourth obtaining unit 202 is configured to input the first sample feature into the initial face recognition network, and obtain a sample face image corresponding to the sample image.

A fifth obtaining unit 203 is configured to input the sample face image into an initial key point detection network, so as to obtain a sample face key point.

On the basis of the above embodiment, as shown in fig. 19, optionally, the training module 22 includes: a sixth acquisition unit 221, a seventh acquisition unit 222, and a training unit 223, wherein:

a sixth obtaining unit 221 is configured to input the enhanced image into the initial backbone network, and obtain a second sample feature of the enhanced image.

The seventh obtaining unit 222 is configured to input the sample image into the teacher model, and obtain a second sample eye movement tracking result.

The training unit 223 is configured to train the initial detection model according to the first sample feature, the second sample feature, the first sample eye movement tracking result, the second sample eye movement tracking result, and the gold standard eye movement tracking result corresponding to the sample image, so as to obtain the detection model.

On the basis of the above embodiment, optionally, the training unit 223 is configured to obtain a value of the first loss function according to the first sample feature and the second sample feature; acquiring a value of a second loss function according to the first sample eye movement tracking result and the golden standard eye movement tracking result; acquiring a value of a third loss function according to the second sample eye movement tracking result and the gold standard eye movement tracking result; and training the initial detection model according to the value of the first loss function, the value of the second loss function and the value of the third loss function to obtain a detection model.

The eye movement tracking device provided by the embodiment can execute the method embodiment, has similar implementation principle and technical effect,

and will not be described in detail herein.

The various modules in the eye tracking devices described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be an electronic device, and the internal structure of which may be as shown in fig. 20. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an eye tracking method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 20 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

Embodiments of the present application also provide a computer-readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of an eye tracking method.

Embodiments of the present application also provide a computer program product containing instructions that, when run on a computer, cause the computer to perform an eye tracking method.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. An eye tracking method, the method comprising:

acquiring an image to be detected;

2. The method of claim 1, wherein the detection model comprises a detection network and a tracking network; performing eye movement tracking according to the image to be detected and a preset detection model to obtain an eye movement tracking result, including:

inputting the image to be detected into the detection network to obtain face key points in the image to be detected;

and acquiring the eye movement tracking result according to the face key points and the tracking network.

3. The method of claim 2, wherein the detection network comprises a backbone network, a face recognition network, and a keypoint detection network; inputting the image to be detected into the detection network to obtain the key points of the face in the image to be detected, wherein the method comprises the following steps:

inputting the image to be detected into the backbone network to obtain the characteristics of the image to be detected;

inputting the characteristics of the image to be detected into the face recognition network to obtain a face image corresponding to the image to be detected;

and inputting the face image into the key point detection network to obtain the face key points.

4. A method according to claim 3, wherein the tracking network comprises an identification network; the step of obtaining the eye movement tracking result according to the face key points and the tracking network comprises the following steps:

Inputting the face key points and the face images into the recognition network to obtain eye images corresponding to the images to be detected;

and obtaining a first gaze point estimation result according to the face key points and the eye images.

5. The method according to claim 4, wherein the method further comprises:

if the first gaze point estimation result indicates that the gaze point is located in an area outside the screen of the electronic equipment, controlling the screen of the electronic equipment to stop when the screen display state of the electronic equipment is in a bright screen state; or alternatively, the process may be performed,

6. The method of claim 4, wherein the tracking network further comprises a segmentation network and an estimation network, the method further comprising:

inputting the human eye image into the segmentation network to obtain an iris segmented human eye image;

inputting the human face key points, the human eye images and the iris-segmented human eye images into the estimation network to obtain a second gaze point estimation result;

And if the second gaze point estimation result indicates that the gaze point is located in the preset area of the screen of the electronic equipment, controlling the electronic equipment to perform page turning operation.

7. The method of claim 6, wherein the method further comprises:

acquiring iris areas according to the iris segmented human eye images;

obtaining the distance from human eyes to a camera of the electronic equipment according to the iris area;

and if the distance is smaller than a preset distance threshold, controlling the electronic equipment to output eye protection prompt information.

8. The method of any of claims 1-7, wherein the initial detection model comprises an initial detection network and an initial tracking network, the method further comprising:

inputting the sample image into the initial detection network to obtain sample face key points in the sample image;

acquiring a first sample eye movement tracking result according to the sample face key points and the initial tracking network;

and training the initial detection model according to the first sample eye movement tracking result, the enhanced image and the teacher model to obtain the detection model.

9. The method of claim 8, wherein the initial detection network comprises an initial backbone network, an initial face recognition network, and an initial keypoint detection network; inputting the sample image into the initial detection network to obtain sample face key points in the sample image, wherein the method comprises the following steps:

Inputting the sample image into the initial backbone network to obtain a first sample characteristic of the sample image;

inputting the first sample characteristics into the initial face recognition network to obtain a sample face image corresponding to the sample image;

and inputting the sample face image into the initial key point detection network to obtain the sample face key point.

10. The method of claim 9, wherein training the initial detection model based on the first sample eye tracking result, the enhanced image, and the teacher model to obtain the detection model comprises:

inputting the enhanced image into the initial backbone network to obtain a second sample feature of the enhanced image;

inputting the sample image into the teacher model to obtain a second sample eye movement tracking result;

and training the initial detection model according to the first sample characteristic, the second sample characteristic, the first sample eye movement tracking result, the second sample eye movement tracking result and the gold standard eye movement tracking result corresponding to the sample image to obtain the detection model.

11. The method of claim 10, wherein the training the initial detection model according to the first sample feature, the second sample feature, the first sample eye-tracking result, the second sample eye-tracking result, and the gold standard eye-tracking result corresponding to the sample image to obtain the detection model comprises:

acquiring a value of a first loss function according to the first sample characteristic and the second sample characteristic;

acquiring a value of a second loss function according to the first sample eye movement tracking result and the gold standard eye movement tracking result;

acquiring a value of a third loss function according to the second sample eye movement tracking result and the gold standard eye movement tracking result;

and training the initial detection model according to the value of the first loss function, the value of the second loss function and the value of the third loss function to obtain the detection model.

12. An eye tracking device, comprising:

the first acquisition module is used for acquiring an image to be detected;

13. An electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the eye tracking method of any of claims 1 to 11.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 11.