CN116704589B

CN116704589B - Gaze point estimation method, electronic device and computer readable storage medium

Info

Publication number: CN116704589B
Application number: CN202211531249.6A
Authority: CN
Inventors: 孙贻宝; 周俊伟; 舒畅; 彭金平
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2024-06-11
Anticipated expiration: 2042-12-01
Also published as: CN116704589A

Abstract

The application provides a gaze point estimation method, electronic equipment and a computer readable storage medium, and relates to the technical field of computers. In the scheme, even if the electronic equipment shoots in an excessively dark or excessively bright environment of a scene, two-dimensional and three-dimensional information of an eye image and eye key points of a user can be acquired, and the gazing information of the user on a display screen can be accurately determined. The method comprises the following steps: acquiring an infrared image and a depth image of a user through a preset camera; identifying an infrared image to obtain a first eye region image of a user and first position information of eye key points of the user in the first eye region image; obtaining a first gaze feature for indicating eye and gaze correlation of the user based on the first eye region image; obtaining second gazing features for indicating three-dimensional features of eye key points of the user from the depth image based on the first position information; and determining the fixation information of the user on the display screen according to the first fixation characteristic and the second fixation characteristic.

Description

Gaze point estimation method, electronic device and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a gaze point estimation method, an electronic device, and a computer readable storage medium.

Background

The gaze point refers to a point of a target object to which the user's gaze is directed during visual perception. Currently, electronic devices may assist users in their work. In the process that the electronic equipment assists the user to work, the electronic equipment can determine the gazing point of the user according to the activities of the user, so that the electronic equipment assists the user to work according to the gazing point of the user. For example, in a scenario where a user reads an electronic book, the mobile phone may determine the gaze point of the user, determine whether the user has a intention to turn pages according to the gaze point of the user, and if so, actively provide a page turning operation for the user. In another example, in a scene of browsing a webpage, the mobile phone can determine the gaze point of the user, judge the content of interest to the user according to the gaze point of the user, and actively recommend the content of interest to the user.

In the process of assisting the user to work by the electronic device, the electronic device needs to face the electronic device, so that the electronic device can acquire the eye geometric features (or the key point information of eyes, such as the outline of the eyes, the opening and closing degree of the corners of the eyes, the pupil direction and the like) of the user, and predict the gaze point of the user according to the acquired eye geometric features. However, in practical applications, in some special shooting environments, such as environments with too dark or too bright scenes, the electronic device cannot acquire the eye geometric features of the user, so that the electronic device cannot accurately determine the gaze point of the user, and further the electronic device fails to assist the user in working, and the reliability of the electronic device in assisting the user in working is reduced.

Disclosure of Invention

The embodiment of the application provides a gaze point estimation method, electronic equipment and a computer readable storage medium, which are used for solving the problem that the electronic equipment cannot accurately determine the gaze point of a user in the prior art. Through the scheme, the fixation point of the user can be accurately determined.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical scheme:

In a first aspect, a gaze point estimation method is provided, where the method is applied to an electronic device, and the electronic device includes a preset camera and a display screen, and the method includes: acquiring an infrared image and a depth image of a user through a preset camera; identifying an infrared image to obtain a first eye region image of a user and first position information of eye key points of the user in the first eye region image; obtaining a first gaze feature based on the first eye region image, wherein the first gaze feature is used to indicate two-dimensional features of the eyes of the user; obtaining a second fixation feature from the depth image based on the first position information, wherein the second fixation feature is used for indicating three-dimensional features of eye key points of the user; and determining the fixation information of the user on the display screen according to the first fixation characteristic and the second fixation characteristic.

In this solution, even in a dark or too bright scene, the infrared (Infrared Radiation, IR) image captured by the electronic device may have a clearer user face image, and an eye region image of the user (hereinafter, a first eye region image) and position information of an eye key point of the user in the eye region image of the user are extracted from the IR image. Since the content of the IR image and the depth image of the user are substantially the same, and the positions of the same content in the respective images are substantially the same, the three-dimensional feature (i.e., the second gaze feature hereinafter) corresponding to the map for indicating the eye key points of the user can be obtained from the depth image by the above-obtained position information (i.e., the first position information hereinafter) of the eye key points of the user in the eye region image of the user. Moreover, the electronic device may also extract two-dimensional feature information (i.e., hereinafter, first gaze feature) for indicating the eyes of the user, second gaze feature, from the eye region image corresponding to the IR image. In this way, the electronic device can determine the gaze information of the user on the display screen based on the two gaze feature information. In this way, the electronic device can accurately determine the gazing information of the user on the display screen based on the two gazing characteristic information. In the scene that the electronic equipment assists the user to work according to the gaze point of the user, the electronic equipment can successfully assist the user to work, and the reliability of the electronic equipment for assisting the user to work is improved.

In a possible implementation manner of the first aspect, the determining, according to the first gaze feature and the second gaze feature, gaze information of a user on a display screen includes: acquiring first eye pose parameters according to the first gazing feature; acquiring a second human eye pose parameter according to the second gazing feature; the first human eye pose parameters and the second human eye pose parameters comprise sight line orientation information of eyes of a user, and the sight line orientation information of the eyes of the user comprises a rotation angle, a pitch angle and a course angle of the eyes of the user under a preset three-dimensional coordinate system; and determining the fixation information of the user on the display screen based on the first eye pose parameter and the second eye pose parameter.

Since the first gaze feature is a two-dimensional feature for indicating eye keypoints of the user; obtaining a second gaze feature based on the second eye region image; the second gaze feature is a three-dimensional feature for indicating eye keypoints of the user. The two-dimensional feature of the eye key point of the user and the three-dimensional feature of the eye key point of the user can predict the direction of eyes of the user, so that the electronic equipment can firstly determine the direction of eyes of the user according to the first gazing feature and the second gazing feature, and then determine gazing information of the user on the display screen based on the direction of eyes of the user.

In a possible implementation manner of the first aspect, the preset three-dimensional coordinate system is a three-dimensional coordinate system with an optical center of the preset camera as an origin.

In a possible implementation manner of the first aspect, a preset correspondence exists between the first eye pose parameter and the second eye pose parameter and gaze information of the user on the display screen, and the determining, based on the first eye pose parameter and the second eye pose parameter, the gaze information of the user on the display screen includes: and determining the gazing information of the user on the display screen from a preset corresponding relation by utilizing the first human eye pose parameter and the second human eye pose parameter.

In a possible implementation manner of the first aspect, before identifying the infrared image and obtaining the first gaze feature, the method further includes: it is determined that the infrared image includes a human eye image.

In some cases, the user's head may not be stationary for a period of time, but may turn, lift, lower the head, etc. In this way, there may be no eye image of the user in the IR image and depth image of the user captured by the electronic device. If the IR image and the depth image of the user shot by the electronic device do not have the eye image of the user, the electronic device consumes a certain amount of energy and time to execute the step. In order to save the energy and time consumed by the electronic device to perform this step, the electronic device may first determine whether the user's eye image is present in the IR image, and if so, then identify the IR image to obtain the first gaze feature.

In a possible implementation manner of the first aspect, acquiring, by a preset camera, an infrared image and a depth image of a user includes: receiving a first operation, wherein the first operation is used for triggering the electronic equipment to start a preset application; and responding to the first operation, and periodically acquiring an infrared image and a depth image of the user through a preset camera after starting the preset application. That is, some applications can be set in the electronic device, and the application can support the gaze point estimation method of the present application, so that the electronic device can periodically collect the infrared image and the depth image of the user through the preset camera after starting the application.

In a possible implementation manner of the first aspect, acquiring, by a preset camera, an infrared image and a depth image of a user includes: and if the electronic equipment is in the preset cooperative working mode, periodically acquiring an infrared image and a depth image of the user through a preset camera. In this scheme, the cooperative working mode may refer to a state in which the electronic device may determine a gaze point of a user, determine an intention of the user according to the gaze point of the user, and execute a corresponding operation based on the intention of the user. For example, in a scenario where a user reads an electronic book, the mobile phone may determine the gaze point of the user, determine whether the user has a intention to turn pages according to the gaze point of the user, and if so, actively provide a page turning operation for the user. If the electronic equipment is in the cooperative working mode, the electronic equipment can also periodically acquire the infrared image and the depth image of the user through the preset camera, so that the electronic equipment can more accurately detect the gazing information of the user on the display screen.

In a possible implementation manner of the first aspect, before the capturing, by the preset camera, the infrared image and the depth image of the user, the method further includes: responding to the starting operation of a user on a preset cooperative switch, and entering a preset cooperative work mode; the preset cooperative switch is configured in a setting interface of the electronic equipment, or is configured in a control center of the electronic equipment, or is configured in a preset application of the electronic equipment.

And setting some applications in the electronic equipment, wherein the applications require a mode of starting cooperative work of the electronic equipment and periodically acquiring infrared images and depth images of a user through a preset camera under the condition that the conditions of starting the applications coexist. The preset cooperative switch may be configured in a setting interface of the electronic device, or the preset cooperative switch is configured in a control center of the electronic device, or the preset cooperative switch is configured in a preset application of the electronic device, and the electronic device may enter a preset cooperative working mode in response to an opening operation of a user on any one of the preset cooperative switches. It should be noted that the electronic device may start the cooperative mode before or after the application is started.

In a possible implementation manner of the first aspect, the preset camera is a TOF camera or a 3D structured light camera.

In a possible implementation manner of the first aspect, the capturing, by a preset camera, an infrared image and a depth image of a user includes: when the user is detected to watch the display screen, an infrared image and a depth image are acquired through a preset camera. The electronic equipment can execute the gaze point estimation method of the scheme only by detecting that the user gazes at the display screen, so that the time for executing the gaze point estimation method of the scheme by the electronic equipment is prolonged. In this way, in the scene that the electronic equipment assists the user to work according to the gaze point of the user, the time length that the electronic equipment accurately assists the user to work can be improved, so that the user experience is improved.

In a second aspect, a gaze point estimation method is provided, where the method is applied to an electronic device, and the electronic device includes a preset camera and a display screen, and the method includes: acquiring an infrared image and a depth image of a user through a preset camera; taking the infrared image and the depth image as input, and operating a human eye gazing information estimation model to obtain gazing information of a user on a display screen; wherein, human eye gazing information estimation model is used for: obtaining a first gazing feature of an eye region of a user according to the identification infrared image, and obtaining a second gazing feature of the eye region of the user according to the identification depth image, and outputting gazing information of the user on a display screen; wherein the first gaze feature is for indicating a two-dimensional feature of an eye key of the user and the second gaze feature is for indicating a three-dimensional feature of an eye key of the user.

It can be understood that the model can have a powerful and rapid prediction function, and the human eye gazing information estimation model is utilized to obtain the gazing information of the user on the display screen, so that the estimation speed of the gazing information can be improved, and the user experience is improved. In a scene that the electronic equipment assists the user to work according to the gaze point of the user, the efficiency of the electronic equipment for assisting the user to work can be improved, and user experience is improved.

In a possible implementation manner of the second aspect, the above-mentioned operational human eye gaze information estimation model is further used for: obtaining a first gazing feature of eye key points of a user according to the identification infrared image, and obtaining a second gazing feature of the eye key points of the user according to the identification depth image, and outputting gazing information of the user on a display screen; wherein the first gaze feature is for indicating a two-dimensional feature of the eyes of the user related to gaze and the second gaze feature is for indicating a three-dimensional feature of the eye key points of the user.

In a possible implementation manner of the second aspect, a preset correspondence exists between the first eye pose parameter and the second eye pose parameter and gaze information of the user on the display screen. The above-described operational eye gaze information estimation model is further configured to: and determining the gazing information of the user on the display screen according to the preset corresponding relation by utilizing the first human eye pose parameter and the second human eye pose parameter.

In a possible implementation manner of the second aspect, the preset three-dimensional coordinate system is a three-dimensional coordinate system with an optical center of the preset camera as an origin.

In a third aspect, an electronic device is provided that includes a memory and one or more processors; the memory is used for storing code instructions; the processor is configured to execute the code instructions to cause the electronic device to perform the gaze point estimation method as in any one of the possible designs of the first aspect and the second aspect.

In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the gaze point estimation method as in any one of the possible designs of the first and second aspects.

In a fifth aspect, a computer program product is provided, comprising computer programs/instructions which, when executed by a processor, implement the gaze point estimation method in any one of the possible designs of the first and second aspects.

The technical effects caused by any one of the design manners of the second aspect, the third aspect, the fourth aspect and the fifth aspect may be referred to the technical effects caused by the different design manners of the first aspect, which are not described herein.

Drawings

FIG. 1 shows a schematic diagram of an IR image and a depth image;

Fig. 2 shows a schematic diagram of a mobile phone 100;

FIG. 3 illustrates an application scenario diagram for acquiring a face image of a user;

FIG. 4 shows a flow diagram of a gaze point estimation method;

FIG. 5 shows a flow diagram of a gaze point estimation method;

FIG. 6 shows a schematic calculation of a model for estimating human eye gaze information;

fig. 7 shows an interface diagram of a mobile phone 100;

Fig. 8 shows an interface schematic of an electronic book application.

Detailed Description

Illustrative embodiments of the application include, but are not limited to, a gaze point estimation method, an electronic device, and a computer readable storage medium.

Embodiments of the present application will now be described with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the present application. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

In order to solve the technical problems in the background art, an embodiment of the application shows a gaze point estimation method which is applied to electronic equipment with a preset camera and a display screen. The method comprises the following steps: the electronic device captures an infrared (Infrared Radiation, IR) image and a depth image of the user through a preset camera, for example, fig. 1 shows a schematic diagram of an IR image and a depth image, such as the IR image and the depth image shown in fig. 1. The IR image is transmitted by the infrared light transmitter and modulated infrared light pulse, the IR image is continuously beaten on the surface of an object, the IR image is received by the receiver after being reflected, the time difference is calculated through the change of the phase, the object depth information is calculated by combining the light speed, and the general photographing effect is not interfered by the ambient light. Therefore, even in a dark or too bright scene, a clearer user face image can be included in the IR image captured by the electronic device, and an eye region image of the user (hereinafter, first eye region image) and position information of eye key points of the user in the eye region image of the user are extracted from the IR image. Since the content of the IR image and the depth image of the user are substantially the same, and the positions of the same content in the respective images are substantially the same, the three-dimensional feature (i.e., the second gaze feature hereinafter) corresponding to the map for indicating the eye key points of the user can be obtained from the depth image by the above-obtained position information (i.e., the first position information hereinafter) of the eye key points of the user in the eye region image of the user. Moreover, the electronic device may also extract two-dimensional feature information (i.e., hereinafter, first gaze feature) for indicating the eyes of the user, second gaze feature, from the eye region image corresponding to the IR image. In this way, the electronic device can determine the gaze information of the user on the display screen based on the two gaze feature information. A specific scheme for determining the gaze information of the user on the display screen based on the two gaze feature information will be described below.

In the scheme, even if the electronic equipment shoots a user in an excessively dark or excessively bright scene, the electronic equipment can acquire the two-dimensional characteristic and the three-dimensional characteristic key point information of the eye key points of the user, and accurately determine the gazing information of the user on the display screen. In the scene that the electronic equipment assists the user to work according to the gaze point of the user, the electronic equipment can successfully assist the user to work, and the reliability of the electronic equipment for assisting the user to work is improved.

By way of example, the electronic device in the embodiments of the present application may be a mobile phone, a tablet, a personal computer, a smart screen, a wearable headset (e.g., a Virtual Reality headset and an augmented Reality headset), etc., where the wearable headset may be augmented Reality (Augmented Reality, AR) glasses, virtual Reality technology (VR) glasses.

The embodiment of the application is illustrated by taking the electronic equipment as a mobile phone. Fig. 2 shows a schematic diagram of a mobile phone 100.

As shown in fig. 2, the mobile phone 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display (touch screen) 194, a subscriber identity module (subscriber identification module, SIM) card interface 195, and the like.

It should be understood that the structure illustrated in this embodiment is not limited to the specific configuration of the mobile phone 100. In other embodiments, the handset 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural Network Processor (NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors. In an embodiment of the present application, the processor 110 is configured to obtain an IR image and a depth image of the user, and determine gaze information of the user on a display (touch screen) 194 of the mobile phone 100 based on the two images. Specifically, the NPU may acquire an IR image and a depth image of the user and determine gaze information of the user on a display (touch screen) 194 of the handset 100 based on the two images.

The controller may be a neural hub and command center of the cell phone 100. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (inter-INTEGRATED CIRCUIT SOUND, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

It should be understood that the connection relationship between the modules illustrated in this embodiment is only illustrative, and is not limited to the structure of the mobile phone 100. In other embodiments, the mobile phone 100 may also use different interfacing manners, or a combination of multiple interfacing manners in the above embodiments.

The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. In some embodiments, the power management module 141 and the charge management module 140 may also be provided in the same device.

The wireless communication function of the mobile phone 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like. In some embodiments, the antenna 1 and the mobile communication module 150 of the handset 100 are coupled, and the antenna 2 and the wireless communication module 160 are coupled, so that the handset 100 can communicate with a network and other devices through wireless communication technology.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the handset 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example, the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc. applied to the handset 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation.

The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication including WLAN (e.g., wireless fidelity (WIRELESS FIDELITY, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation SATELLITE SYSTEM, GNSS), frequency modulation (frequency modulation, FM), near field communication (NEAR FIELD communication, NFC), infrared (IR), etc. applied to the mobile phone 100.

The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

The mobile phone 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display 194, an application processor, and the like. The ISP is used to process data fed back by the camera 193. The camera 193 is used to capture still images or video. In some embodiments, the cell phone 100 may include 1 or N cameras 193, where N is a positive integer greater than or equal to 1. In the embodiment of the present application, after the mobile phone 100 starts the camera 193, real-time image data can be acquired.

The camera 193 may be a Time of Flight (TOF), which is a depth information measurement scheme mainly composed of an infrared light projector and a receiving module. The projector projects infrared light outwards, the infrared light is reflected after encountering a measured object and is received by the receiving module, the depth information of the measured object is calculated by recording the time from the emission of the infrared light to the receiving of the infrared light, and the 3D modeling is completed. The TOF camera can acquire IR images and depth images. The TOF camera can be used to project IR light onto a face to acquire an IR image of the face of the user. Fig. 3 shows a schematic view of an application scenario for acquiring a face image of a user. For example, as shown in fig. 3, the handset 100 may activate an infrared camera to capture an IR image of a person's face. In addition, IR images and depth images may also be acquired by a 3D structured light camera, but are not limited thereto.

The mobile phone 100 implements display functions through a GPU, a display 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. In the embodiment of the present application, the gaze information may include gaze point information or gaze area information of the user on the display screen 194 of the mobile phone 100.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capabilities of the handset 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

The internal memory 121 may be used to store computer executable program code including instructions. The processor 110 executes various functional applications of the cellular phone 100 and data processing by executing instructions stored in the internal memory 121. For example, in an embodiment of the present application, the processor 110 may include a storage program area and a storage data area by executing instructions stored in the internal memory 121.

The storage program area may store an application program (such as a sound playing function, a service preemption function, etc.) required for at least one function of the operating system, etc. The storage data area may store data (e.g., audio data, phonebook, etc.) created during use of the handset 100, etc. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.

The handset 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration alerting as well as for touch vibration feedback. The indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, a message indicating a missed call, a notification, etc. The SIM card interface 195 is used to connect a SIM card. The SIM card may be inserted into the SIM card interface 195 or removed from the SIM card interface 195 to enable contact and separation with the handset 100. The mobile phone 100 may support 1 or N SIM card interfaces, where N is a positive integer greater than or equal to 1. The SIM card interface 195 may support Nano SIM cards, micro SIM cards, and the like.

The embodiment of the application provides a gaze point estimation method, which can be applied to electronic equipment (such as a mobile phone 100) with the above hardware structure. Fig. 4 shows a flow diagram of a gaze point estimation method. As shown in fig. 4, the gaze point estimation method provided by the embodiment of the present application may include the following steps:

401: the cell phone 100 captures IR images and depth images of the user through the camera 193.

The photographing effect of the IR image is not disturbed by ambient light. Therefore, even in a dark or too bright scene, the IR image photographed by the mobile phone 100 may have a clearer user face image therein, and an eye region image of the user (hereinafter, first eye region image) and position information of eye key points of the user in the eye region image of the user are extracted from the IR image. Since the content of the IR image and the depth image of the user are substantially the same, and the positions of the same content in the respective images are substantially the same, the three-dimensional feature (i.e., hereinafter, the second gaze feature) corresponding to the map for indicating the eye key points of the user can be obtained from the depth image by the above-obtained position information of the eye key points of the user in the eye region image of the user. Moreover, the mobile phone 100 may also extract a two-dimensional feature (hereinafter, first gaze feature) for indicating that the eyes of the user are related to gaze from the eye area image corresponding to the IR image. In this way, the mobile phone 100 can determine the gaze information of the user on the display 194 based on the two gaze feature information.

In some embodiments, the triggering conditions required for capturing IR and depth images of a user by the cell phone 100 via the camera 193 are described below by way of example as follows:

Mode 1:

some applications may be set in the mobile phone 100, and the application may support the gaze point estimation method of the present application, so that after the application is started, the mobile phone 100 may periodically collect the infrared image and the depth image of the user through the preset camera 193. Specifically, the mobile phone 100 receives a first operation, where the first operation is used to trigger the mobile phone 100 to start a preset application; in response to the first operation, after the preset application is started, the cellular phone 100 periodically acquires an infrared image and a depth image of the user through the preset camera 193. The preset application may be an electronic book application, a browser application, a news application, etc., but is not limited thereto.

Mode 2:

The cooperative mode may refer to a state in which the mobile phone 100 may determine a gaze point of a user, determine an intention of the user according to the gaze point of the user, and perform a corresponding operation based on the intention of the user. For example, in a scenario where a user reads an electronic book, the mobile phone may determine the gaze point of the user, determine whether the user has a intention to turn pages according to the gaze point of the user, and if so, actively provide a page turning operation for the user. In another example, in a scenario of browsing a web page, the mobile phone 100 may determine the gaze point of the user, determine the content of interest to the user according to the gaze point of the user, and actively recommend the content of interest to the user. If the mobile phone 100 is in the collaborative mode, the mobile phone 100 may also periodically collect the infrared image and the depth image of the user through the preset camera 193, so that the mobile phone 100 can more accurately detect the gazing information of the user on the display 194. Specifically, if the mobile phone 100 is in the preset cooperative mode, the mobile phone 100 periodically acquires the infrared image and the depth image of the user through the camera 193.

Wherein in some embodiments. The mobile phone 100 may enter a preset cooperative working mode in response to an on operation of a preset cooperative switch by a user, and the preset cooperative switch may be configured in a setting interface of the mobile phone 100, or the preset cooperative switch may be configured in a control center of the mobile phone 100, or the preset cooperative switch may be configured in a preset application of the mobile phone 100.

Mode 3:

The mobile phone 100 is provided with applications that require a mode in which the mobile phone 100 starts a cooperative work and periodically collect an infrared image and a depth image of a user through the preset camera 193 when the conditions for starting the applications coexist in the mobile phone 100.

The preset cooperative switch may be configured in a setting interface of the mobile phone 100, or the preset cooperative switch is configured in a control center of the mobile phone 100, or the preset cooperative switch is configured in a preset application of the mobile phone 100, and the mobile phone 100 may enter a preset cooperative working mode in response to an opening operation of a user on any one of the preset cooperative switches.

It should be noted that, before or after the application is started, the mobile phone 100 may start the cooperative mode.

Mode 4:

In some embodiments, the mobile phone 100 may support the gaze point estimation method of the present application in a power-on state. Then, the mobile phone 100 may execute the gaze point estimation method of the present embodiment as long as detecting that the user gazes at the display 194, so as to increase the duration of executing the gaze point estimation method of the present embodiment by the mobile phone 100. In this way, in the scenario where the mobile phone 100 assists the user to work according to the gaze point of the user, the time period during which the mobile phone 100 accurately assists the user to work can be increased, so that the user experience is improved. Specifically, the mobile phone 100 acquires an infrared image and a depth image through the preset camera 193 when it detects that the user looks at the display 194.

402: The handset 100 determines the user's gaze information on the display 194 based on the user's IR image and depth image.

The gaze information may include gaze point information, gaze area information, or gaze point information and gaze area information of the user on the display 194 of the handset 100, where the gaze point information may be coordinates of a gaze point and the gaze area information may be an identification of a gaze area. For example, as shown in fig. 5, the display 194 of the mobile phone 100 may be divided into a plurality of gaze areas: the gaze areas 1 to 8 may be the identities of the gaze areas: numbers 1 to 8. The gaze point information may be gaze point coordinates (x 1, y 2) of the user on the display 194 of the handset 100.

The cell phone 100 may determine the user's gaze information on the display 194 based on the I R images and depth images of the user, and a specific implementation is provided below. For example, fig. 5 shows a flow diagram of a gaze point estimation method. As shown in fig. 5, the steps 501 to 504 are included as follows:

501: the mobile phone 100 identifies I R images to obtain a first eye region image of the user and first position information of eye key points of the user in the first eye region image; the first position information may be coordinates of each pixel of the eye key point of the user in the I R image, where the coordinates may be an x-axis coordinate and a y-axis coordinate in a preset two-dimensional coordinate system.

It is understood that the eye key may include the human eye and components surrounding the human eye. Such as eyebrows, eyelids, eyeballs, eyelashes, corners of the eye, pupils, etc.

It will be appreciated that the mobile phone 100 may identify I R images via some target detection algorithm, to obtain a first eye region image of the user, and eye key points of the user in the first eye region image. For example, the handset 100 shown in fig. 1 obtains a first eye region image E of the user from I R images.

The preset two-dimensional coordinate system may be a two-dimensional coordinate system preset by the mobile phone 100. In this way, after obtaining the eye key points of the user in the first eye area image, the mobile phone 100 can obtain the coordinates of each pixel point of the eye key points of the user in the first eye area image based on the two-dimensional coordinate system, and the coordinates of the pixel points form the first position information.

502: The handset 100 obtains a first gaze feature based on the first eye region image, wherein the first gaze feature is used to indicate a two-dimensional feature of the eyes of the user that is related to gaze.

The eye region image has many pieces of feature information related to the eyes of the user with respect to gaze, and thus, the mobile phone 100 can obtain features indicating the eyes of the user with respect to gaze based on the eye region image. The characteristics of the eyes of the user related to gaze may include information of the human eye and components surrounding the human eye, such as information of the eyebrows, eyelid, eyeball, eyelash, canthus, pupil, etc. In particular, the first gaze feature may include morphological and status information of the human eye. The morphological information of the human eye may be the shape and size of the human eye, such as pupil shape and size, eyelid type, eye contour size. The state information of the human eye may be dynamic information of the human eye, for example, the state information of the human eye may be information of eyeball rotation state, eyeball pose, opening and closing degree of the corner of the eye, pupil direction and the like.

503: The handset 100 derives a second gaze feature from the depth image based on the first location information, wherein the second gaze feature is used to indicate a three-dimensional feature of an eye key point of the user.

Since the content of the I R image and the depth image of the user are substantially the same, and the positions of the same content in the respective images are substantially the same, the mobile phone 100 can obtain the second gazing feature for indicating the three-dimensional feature of the eye key point of the user from the depth image which is also located in the preset two-dimensional coordinate system through the first position information obtained above.

The second gaze feature comprises 3D information of the user's eyes, the 3D information of the user's eyes comprising three-dimensional coordinate information (x, y, z), wherein x represents an x-axis position of the user's eyes in a preset three-dimensional coordinate system, y represents a y-axis position of the user's eyes in the preset three-dimensional coordinate system, and z represents a distance between the user and an optical center of the camera 193 of the mobile phone 100. The preset three-dimensional coordinate system may be a three-dimensional coordinate system with the optical center of the camera 193 as the origin.

In some embodiments, the handset 100 may derive the second gaze feature from the depth image, which is also located in the preset two-dimensional coordinate system, based on the first location information.

In some embodiments, the preset three-dimensional coordinate system has a mapping relationship with the preset two-dimensional coordinate system. The handset 100 may derive the second gaze feature in the depth image from the mapping based on the first location information. 504: the mobile phone 100 determines the user's gaze information on the display 194 based on the first gaze feature and the second gaze feature.

Since the first gaze feature is a two-dimensional feature for indicating that the eyes of the user are gaze related, the second gaze feature is a three-dimensional feature for indicating that the eyes of the user are key points. The two-dimensional feature of the eyes of the user related to gaze and the three-dimensional feature of the eye key points of the user can predict the direction of eyes of the user, so that the mobile phone 100 can determine the direction of eyes of the user according to the first gaze feature and the second gaze feature, and then determine the gaze information of the user on the display screen 194 based on the direction of eyes of the user.

Specifically, the method comprises the following steps 5041-5042:

5041: the mobile phone 100 obtains first eye pose parameters according to the first gazing feature; acquiring a second human eye pose parameter according to the second gazing feature; the human eye pose parameters comprise sight line orientation information of eyes of a user, wherein the sight line orientation information of the eyes of the user comprises a rotation angle, a pitch angle and a course angle of the eyes of the user under a preset three-dimensional coordinate system; the preset three-dimensional coordinate system may be a three-dimensional coordinate system with the optical center of the camera 193 as the origin.

5042: The cell phone 100 determines the user's gaze information on the display screen based on the first and second eye pose parameters.

In some embodiments, a preset correspondence exists between the human eye pose parameter and the user's gaze information on the display screen 194, and the mobile phone 100 may determine the user's gaze information on the display screen 194 from the preset correspondence using the first human eye pose parameter and the second human eye pose parameter.

It can be understood that the model can have a powerful and rapid prediction function, the eye gaze information estimation model is utilized to obtain the gaze information of the user on the display screen, the estimation speed of the gaze information can be improved, the user experience is improved, and in a scene that the electronic equipment assists the user to work according to the gaze point of the user, the efficiency of the electronic equipment assisting the user to work is improved, and the user experience is improved. Thus, in some embodiments, the functionality of step 402 may be integrated into a human eye gaze information estimation model. In this way, the mobile phone 100 can input the IR image and the depth image of the user into the trained eye gaze information estimation model, and determine the gaze information of the user on the display 194 of the mobile phone 100 by using the eye gaze information estimation model.

For example, fig. 6 shows a schematic calculation of a model for estimating human eye gaze information. As shown in fig. 6, the human eye gaze information estimation model includes an eye region image determination module 1, a first gaze feature determination module 2, a second gaze feature determination module 3, and a gaze information output module 4.

The eye area image determination module 1 has a function of executing the content of step 501 described above.

The first gaze feature determination module 2 has the functionality to perform the content of step 502 described above.

The second gaze feature determination module 3 has the functionality to perform the content of step 503 described above.

The gaze information output module 4 has a function of executing the content of step 504 described above. The gaze information output module 4 may be a model with classification functions and may include a fully connected layer.

It will be appreciated that the following steps 1 and 2 may be performed during the preliminary setup phase of the eye gaze information estimation model to obtain a plurality of training samples for training the eye gaze information estimation model. Step 1: IR images and depth maps of the user are acquired. Step 2, the real gaze information of the user on the display screen 194 corresponding to the IR image and the depth map of the user.

It can be understood that the training samples are adopted in the stage of initially establishing the human eye gazing information estimation model to train the human eye gazing information estimation model. Therefore, the human eye gaze information estimation model after multiple sample training may be provided with the user gaze information on the display screen 194 corresponding to the IR image and the depth map of the user. And, the more times the sample is trained, the higher the accuracy of the gaze information of the user on the display screen 194 obtained by the human eye gaze information estimation model. Thus, in the embodiment of the present application, the pre-configured eye gaze information estimation model in the mobile phone 100 may be an AI model trained by a large number of samples.

The AI model may be, for example, a convolutional neural network, and may be, for example, a model structure based on a convolutional layer, a pooling layer, and a fully-connected layer, which is not limited by the embodiment of the present application.

The eye gaze information estimation model may output gaze information including gaze point information and gaze area information. The eye gaze information estimation model outputs a gaze region by a classification method, and improves gaze point estimation accuracy by combining multitasking (classification and regression) learning. Compared with the prior proposal in the industry, the method has higher algorithm robustness and richer use scene. In summary, even if the mobile phone 100 shoots a user in an environment with too dark or too bright scene, the mobile phone 100 can acquire the eye geometric features of the user, and further accurately determine the gaze point of the user by using the depth image, so that the mobile phone 100 can successfully assist the user in working, and the reliability of the mobile phone 100 in assisting the user in working is improved.

In some cases, the user's head may not be stationary for a period of time, but may turn, lift, lower the head, etc. As such, there may be no eye image of the user in the IR image and depth image of the user captured by the mobile phone 100. If the IR image and the depth image of the user photographed by the mobile phone 100 do not include the eye image of the user, the mobile phone 100 consumes a certain amount of energy and time to perform the step. In order to save the energy and time consumed by the mobile phone 100 to perform this step, the mobile phone 100 may also determine whether the user's eye image exists in the IR image of the user, if so, perform this step, and if not, re-perform step 401, that is, collect the IR image and the depth image of the user again through the camera 193.

The mobile phone 100 may utilize the gaze information of the user on the display 194 of the mobile phone 100 to perform a collaborative operation with the user of the mobile phone 100. For example, fig. 7 shows an interface diagram of a mobile phone 100. As shown in fig. 7, the mobile phone 100 has an electronic book application, and after clicking an electronic book application icon a, the user reads the electronic book. During the process of reading the electronic book by the user, the mobile phone 100 may determine the gaze point of the user, and determine whether the user has the intention of turning the page according to the gaze point of the user. For example, FIG. 8 illustrates an interface schematic of an electronic book application. As shown in fig. 8, the mobile phone 100 may determine whether the user's gaze point is on the go-to-previous button B1 or the go-to-next button B2, and if it is on the go-to-next button B2, actively provide the user with a page-down operation. In this way, the user's smart experience of the handset 100 working in concert with the user may be improved.

The gaze point estimation method provided by the embodiment of the application can be applied to man-machine interaction of smart phones, flat plates, intelligent screens and AR/VR glasses.

The existing gaze point estimation method only estimates the gaze point through the RGB image, and does not consider the influence of images shot by different depths and different poses on the gaze point, so that the gaze point estimation accuracy is reduced. The present application can also perform gaze point estimation in dim light scenes based on IR images.

The existing human eye gazing information estimation model takes an extracted face image and a face grid as input. If the user wears the mask, face extraction may fail, and gaze point estimation may not be possible. The mask wearing method is based on the eye features only, and the feature extraction is not affected by wearing the mask, so that the mask wearing method can cope with the scene of a user wearing the mask.

Another embodiment of the present application provides an electronic device including: a memory and one or more processors. The memory is coupled to the processor. Wherein the memory also stores computer program code comprising computer instructions. The electronic device, when executed by the processor, can perform the various functions or steps performed by the handset 100 in the method embodiments described above. The structure of the electronic device may refer to the structure of the mobile phone 100 shown in fig. 2.

Embodiments of the present application also provide a computer readable storage medium, where the computer readable storage medium includes computer instructions, where the computer instructions, when executed on an electronic device, cause the electronic device to perform the functions or steps performed by the mobile phone 100 in the foregoing method embodiments.

Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the functions or steps performed by the handset 100 in the method embodiments described above. The computer may be the electronic device (e.g., cell phone 100) described above.

Embodiments of the disclosed mechanisms may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as a computer program or program code that is executed on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of the present application, a processing system includes any system having a Processor such as, for example, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), microcontroller, application SPECIFIC INTEGRATED Circuit (ASIC), or microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in the present application are not limited in scope by any particular programming language. In either case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed over a network or through other computer-readable storage media. Thus, a machine-readable storage medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, read-Only memories (CD-ROMs), magneto-optical disks, read-Only memories (ROMs), random access memories (Random Access Memory, RAMs), erasable programmable Read-Only memories (Erasable Programmable Read Only Memory, EPROMs), electrically erasable programmable Read-Only memories (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only memories, EEPROMs), magnetic or optical cards, flash Memory, or tangible machine-readable Memory for transmitting information (e.g., carrier waves, infrared signal digital signals, etc.) in an electrical, optical, acoustical or other form of transmission signal based on the internet. Thus, a machine-readable storage medium includes any type of machine-readable storage medium suitable for storing or propagating electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some structural or methodological features may be shown in a particular arrangement and/or order. However, it should be understood that such a particular arrangement and/or ordering may not be required. Rather, in some embodiments, these features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of structural or methodological features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the present application, each unit/module mentioned in each device is a logic unit/module, and in physical terms, one logic unit/module may be one physical unit/module, or may be a part of one physical unit/module, or may be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logic unit/module itself is not the most important, and the combination of functions implemented by the logic unit/module is only a key for solving the technical problem posed by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-described device embodiments of the present application do not introduce units/modules that are less closely related to solving the technical problems posed by the present application, which does not indicate that the above-described device embodiments do not have other units/modules.

It should be noted that in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the application.

Claims

1. The method for estimating the gaze point is characterized in that the method is applied to electronic equipment, the electronic equipment comprises a preset camera and a display screen, and the method comprises the following steps:

acquiring an infrared image and a depth image of a user through the preset camera;

Identifying an infrared image to obtain a first eye region image of a user and first position information of eye key points of the user in the first eye region image;

Obtaining a first fixation feature based on the first eye region image, wherein the first fixation feature is used for indicating two-dimensional features of eyes of a user related to fixation;

Obtaining a second fixation feature from the depth image based on the first position information, wherein the second fixation feature is used for indicating three-dimensional features of eye key points of a user;

acquiring first eye pose parameters according to the first gazing characteristics;

Acquiring a second human eye pose parameter according to the second gazing feature; the first human eye pose parameter and the second human eye pose parameter comprise sight line orientation information of eyes of a user, and the sight line orientation information of the eyes of the user comprises a rotation angle, a pitch angle and a course angle of the eyes of the user under a preset three-dimensional coordinate system; the preset three-dimensional coordinate system takes the optical center of the preset camera as an origin;

and determining the gazing information of the user on the display screen based on the first human eye pose parameter and the second human eye pose parameter.

2. The method of claim 1, wherein there is a preset correspondence between the first and second human eye pose parameters and gaze information of the user on the display screen;

the determining, based on the first human eye pose parameter and the second human eye pose parameter, gaze information of the user on the display screen includes:

and determining the gazing information of the user on the display screen according to the preset corresponding relation by utilizing the first human eye pose parameter and the second human eye pose parameter.

3. The method according to claim 1 or 2, wherein prior to the deriving a first gaze feature based on the first eye region image, the method further comprises:

and determining that the infrared image comprises a human eye image.

4. A method according to any one of claims 1-3, wherein the capturing of the infrared image and the depth image of the user by the preset camera comprises:

receiving a first operation, wherein the first operation is used for triggering the electronic equipment to start a preset application;

And responding to the first operation, and periodically acquiring the infrared image and the depth image of the user through the preset camera after the preset application is started.

5. A method according to any one of claims 1-3, wherein the capturing of the infrared image and the depth image of the user by the preset camera comprises:

And if the electronic equipment is in a preset cooperative working mode, periodically acquiring the infrared image and the depth image of the user through the preset camera.

6. The method of claim 5, wherein prior to the capturing the infrared image and the depth image of the user by the preset camera, the method further comprises:

responding to the starting operation of a user on a preset cooperative switch, and entering a preset cooperative working mode;

the preset cooperative switch is configured in a setting interface of the electronic equipment, or is configured in a control center of the electronic equipment, or is configured in a preset application of the electronic equipment.

7. The method according to any one of claims 1-6, wherein the capturing, by the preset camera, an infrared image and a depth image of a user comprises:

And when the user is detected to watch the display screen, acquiring the infrared image and the depth image through the preset camera.

8. The method according to any one of claims 1-7, wherein the preset camera is a TOF camera or a 3D structured light camera.

9. The method for estimating the gaze point is characterized in that the method is applied to electronic equipment, the electronic equipment comprises a preset camera and a display screen, and the method comprises the following steps:

Taking the infrared image and the depth image as input, and operating a human eye gazing information estimation model to obtain gazing information of the user on the display screen;

wherein, the human eye gazing information estimation model is used for:

10. An electronic device comprising a memory and one or more processors; the memory is used for storing code instructions; the processor is configured to execute the code instructions to cause the electronic device to perform the method of any of claims 1-9.

11. A computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method of any of claims 1-9.