CN117711040A

CN117711040A - Calibration method and electronic equipment

Info

Publication number: CN117711040A
Application number: CN202310595864.1A
Authority: CN
Inventors: 孙贻宝
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2024-03-15

Abstract

The invention provides a calibration method and electronic equipment, which are applied to the technical field of terminals and can output a gaze point estimation model with higher recognition accuracy. The method provided by the present disclosure includes: acquiring a sample data set; respectively extracting features of the first face image and the plurality of second face images based on an initial feature extraction network in the initial gaze point estimation model to obtain first face image features and a plurality of second face image features; carrying out regression processing on the first face image features, the plurality of second face image features and the plurality of second gaze points based on an initial regression network in the initial gaze point estimation model to obtain predicted gaze points corresponding to the first face images; calculating a first loss value between the predicted gaze point and the first gaze point, and a second loss value between the first face image feature and the plurality of second face image feature predictions; and iteratively updating the initial gaze point estimation model according to the first loss value and the second loss value to obtain a trained gaze point estimation model.

Description

Calibration method and electronic equipment

Technical Field

The disclosure relates to the technical field of terminals, and in particular relates to a calibration method and electronic equipment.

Background

Eye tracking techniques can answer questions of where and for how long the user looks through gaze point estimation models. Before the gaze point estimation model is actually used, the gaze point estimation model needs to be calibrated, so that the recognition accuracy of the gaze point estimation model is improved.

And calibrating the initial gaze point estimation model based on a related calibration technology (such as a linear correction calibration method) to obtain the gaze point estimation model. And performing gaze point estimation on the predicted image by using the gaze point estimation model to obtain a predicted gaze point. By comparing the predicted gaze point with the corresponding real gaze point of the predicted image, the error between the predicted gaze point and the real gaze point exceeds the specified error, i.e. the calibration effect of the gaze point estimation model calibrated by the related technology is poor. Therefore, how to improve the calibration effect of the gaze point estimation model is a problem to be solved.

Disclosure of Invention

The embodiment of the disclosure provides a calibration method and electronic equipment, which can improve the calibration effect of a gaze point estimation model, reduce the error between a predicted gaze point and a real gaze point, and further improve the recognition accuracy of the gaze point estimation model.

In order to achieve the above object, the embodiments of the present disclosure adopt the following technical solutions:

In a first aspect, the present disclosure provides a calibration method, the method comprising: the training equipment acquires a sample data set, wherein the sample data set comprises a first face image, a first gazing point corresponding to the first face image, a plurality of second face images and a second gazing point corresponding to each second face image; the first face image and the second face image are face images of the same user; then, the training equipment respectively performs feature extraction on the first face image and the plurality of second face images based on an initial feature extraction network in the initial gaze point estimation model to obtain first face image features and a plurality of second face image features; then carrying out regression processing on the first face image features, the plurality of second face image features and the plurality of second gaze points based on an initial regression network in the initial gaze point estimation model to obtain a predicted gaze point corresponding to the first face image; then calculating a first loss value between the predicted gazing point and the first gazing point and a second loss value between the first face image feature and the predictions of the plurality of second face image features; and finally, iteratively updating the initial gaze point estimation model according to the first loss value and the second loss value to obtain the trained gaze point estimation model.

Based on the calibration method of the first aspect, after the training device acquires the first face image feature corresponding to the first face image and the plurality of second face image features corresponding to the plurality of second face images by using an initial feature extraction network in the initial gaze point estimation model, regression processing is performed on the first face image feature, the plurality of second face image features and the plurality of second gaze points by using an initial regression network. That is, the initial regression network introduces a plurality of second gaze points (i.e., real gaze points) when performing the regression process, which may be used to correct the regression direction of the initial regression network, so that the predicted gaze point output by the initial regression network is more accurate. And finally, using the first gaze point (namely the real gaze point) as supervision information, and iteratively updating the initial gaze point estimation model so as to enable the calibration effect of the trained gaze point estimation model to be better.

With reference to the first aspect, in another possible implementation manner, performing feature extraction on the first face image and the plurality of second face images based on an initial feature extraction network in an initial gaze point estimation model, to obtain a first face image feature and a plurality of second face image features, includes: the training equipment pre-processes the first face image to obtain first identification data corresponding to the first face image, wherein the first identification data comprises a left eye image of the first face, a right eye image of the first face, a first face area image and a first face grid image; preprocessing the plurality of second face images to obtain second identification data corresponding to each second face image in the plurality of second face images; the second is that the data include the left eye image of the second face, the right eye image of the second face, the second face area image and the second face grid image; the training equipment performs feature extraction on the first identification data by utilizing an initial feature extraction network to obtain first face image features; and carrying out feature extraction on the plurality of second identification data by utilizing the initial feature extraction network to obtain a plurality of second face image features.

Based on this possible implementation, the training device further performs a preprocessing operation on the first face image and the plurality of second face images before extracting the first face image features and the plurality of second face image features using the initial feature extraction network. The first recognition data and the plurality of second recognition data obtained through the preprocessing operation include only features related to the face, such as left eye, right eye, face area, face mesh, and the like. Therefore, the features extracted by the subsequent initial feature extraction network are all the features of the human face, and extraction of irrelevant features in the first human face image and the plurality of second human face images is avoided. The accuracy of the feature extraction result is improved.

With reference to the first aspect, in another possible implementation manner, performing regression processing on the first face image feature, the plurality of second face image features, and the plurality of second gaze points based on an initial regression network in the initial gaze point estimation model to obtain a predicted gaze point corresponding to the first face image, including: splicing the first face image features, the plurality of second face image features and the plurality of second fixation points to obtain spliced data; and carrying out regression processing on the spliced data based on the initial regression network to obtain the predicted fixation point.

Based on this possible implementation, the training device introduces not only the first face image feature, the plurality of second face image features, but also the plurality of second gaze points when calculating the predicted gaze point using the initial regression network. Since the plurality of second gaze points are real gaze points corresponding to the plurality of second face images, the initial regression network can predict in a direction closer to the plurality of second gaze points during prediction, thereby outputting a predicted gaze point with higher accuracy.

With reference to the first aspect, in another possible implementation manner, iteratively updating the initial gaze point estimation model according to the first loss value and the second loss value to obtain a trained gaze point estimation model includes: iteratively updating the initial regression network according to the first loss value to obtain a trained regression network; iteratively updating the initial feature extraction network according to the second loss value to obtain a trained feature extraction network; the trained gaze point estimation model comprises a trained regression network and a trained feature extraction network.

Based on the possible implementation manner, the training device iteratively updates the initial regression network through the predicted gaze point output by the initial regression network and the first loss value between the first gaze point (real gaze point), so that the predicted gaze point output by the trained regression network is closer to the real gaze point. The initial feature extraction network is iteratively updated through a second loss value between a plurality of second face image features and the first face image features output by the initial feature extraction network, so that the image features output by the trained feature extraction network are more similar to real image features.

With reference to the first aspect, in another possible implementation manner, the first loss value satisfies the following relationship:

wherein Lg is used to represent a first loss value, G _q For the representation of the predicted gaze point,for representing a first gaze point.

Based on the possible implementation, by the above relationIt is known that the calculation of the first loss value takes into account the distance between the predicted gaze point and the first gaze point. By ∈10 in the above relation>It can be seen that the injectionThe loss of view value also takes into account the angle between the predicted gaze point and the first gaze point.

In this way, it can be ensured that the deviation between the predicted gaze point and the first gaze point can still be constrained by the loss between the angles in case the distance between the predicted gaze point and the first gaze point is small and the angle between the predicted gaze point and the first gaze point is large. The first loss value determined in this way is more constrained than the distance between the predicted gaze point and the first gaze point alone. And the initial gaze point estimation model is iteratively updated by utilizing the relation, so that the recognition accuracy of the trained gaze point estimation model is higher, and the predicted gaze point output by the trained gaze point estimation model is closer to the real gaze point.

With reference to the first aspect, in another possible implementation manner, the second loss value satisfies the following relationship:

wherein Lx is used for representing a second loss value, xc is used for representing a plurality of second face image features, xq is used for representing a first face image feature, m=1, s=1 is used for representing that the distance between the first gaze point and the second gaze point is smaller than or equal to a preset distance, and s=0 is used for representing that the distance between the first gaze point and the second gaze point is larger than the preset distance. And then, the initial feature extraction network is iteratively trained by using the contrast loss value, so that the feature extraction effect of the feature extraction network after training is better.

With reference to the first aspect, in another possible implementation manner, the preset distance is 2cm.

With reference to the first aspect, in another possible implementation manner, the method further includes: the training device sends the trained gaze point estimation model to the terminal device, so that the terminal device outputs the estimated gaze point corresponding to the to-be-processed sight line interaction image by applying the trained gaze point estimation model.

Based on this possible implementation, when the training device generates a trained gaze point estimation model, the trained gaze point estimation model may be sent to the terminal device. And then, when the terminal equipment realizes the sight line interaction function, the initial gaze point estimation model does not need to be trained, so that the data processing efficiency of the terminal equipment is higher.

In a second aspect, an embodiment of the present disclosure provides a calibration device, which may be applied to an electronic apparatus, for implementing the method in the first aspect. The function of the calibration device can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, for example, an acquisition module, a feature extraction module, a processing module, a determination module, an update module, and the like.

The acquisition module is configured to acquire a sample data set, wherein the sample data set comprises a first face image, a first gaze point corresponding to the first face image, a plurality of second face images and a second gaze point corresponding to each second face image; the first face image and the second face image are face images of the same user; the feature extraction module is configured to respectively perform feature extraction on the first face image and the plurality of second face images based on an initial feature extraction network in the initial gaze point estimation model to obtain first face image features and a plurality of second face image features; the processing module is configured to perform regression processing on the first face image features, the plurality of second face image features and the plurality of second gaze points based on an initial regression network in the initial gaze point estimation model to obtain a predicted gaze point corresponding to the first face image; a determination module configured to calculate a first loss value between the predicted gaze point and the first gaze point, and a second loss value between the first face image feature and the plurality of second face image feature predictions; and the updating module is configured to iteratively update the initial gaze point estimation model according to the first loss value and the second loss value to obtain a trained gaze point estimation model.

With reference to the second aspect, in one possible implementation manner, the feature extraction module is further configured to pre-process the first face image to obtain first identification data corresponding to the first face image, where the first identification data includes a left eye image of the first face, a right eye image of the first face, a first face region image, and a first face mesh image; preprocessing the plurality of second face images to obtain second identification data corresponding to each second face image in the plurality of second face images; the second is that the data include the left eye image of the second face, the right eye image of the second face, the second face area image and the second face grid image; extracting features of the first identification data by using an initial feature extraction network to obtain first face image features; and carrying out feature extraction on the plurality of second identification data by utilizing the initial feature extraction network to obtain a plurality of second face image features.

With reference to the second aspect, in a possible implementation manner, the processing module is further configured to splice the first face image feature, the plurality of second face image features, and the plurality of second gaze points to obtain spliced data; and carrying out regression processing on the spliced data based on the initial regression network to obtain the predicted fixation point.

With reference to the second aspect, in a possible implementation manner, the updating module is further configured to iteratively update the initial regression network according to the first loss value to obtain a trained regression network; iteratively updating the initial feature extraction network according to the second loss value to obtain a trained feature extraction network; the trained gaze point estimation model comprises a trained regression network and a trained feature extraction network.

With reference to the second aspect, in one possible implementation manner, the first loss value satisfies the following relationship:

With reference to the second aspect, in one possible implementation manner, the second loss value satisfies the following relationship:

wherein Lx is used for representing a second loss value, xc is used for representing a plurality of second face image features, xq is used for representing a first face image feature, m=1, s=1 is used for representing that the distance between the first gaze point and the second gaze point is smaller than or equal to a preset distance, and s=0 is used for representing that the distance between the first gaze point and the second gaze point is larger than the preset distance.

With reference to the second aspect, in a possible implementation manner, the preset distance is 2cm.

With reference to the second aspect, in a possible implementation manner, the calibration device may further include a sending module. And the sending module is configured to send the trained gaze point estimation model to the terminal equipment so that the terminal equipment can apply the trained gaze point estimation model to output the estimated gaze point corresponding to the sight line interaction image to be processed.

In a third aspect, the present disclosure provides an electronic device comprising: a memory, a display screen, and one or more processors; the memory, display screen and processor are coupled. Wherein the memory is for storing computer program code, the computer program code comprising computer instructions; the processor is configured to execute one or more computer instructions stored by the memory when the electronic device is operating, to cause the electronic device to perform the calibration method as described in any one of the first aspects above.

In a fourth aspect, the present disclosure provides a computer storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the calibration method according to any one of the first aspects.

In a fifth aspect, the present disclosure provides a computer program product for, when run on an electronic device, causing the electronic device to perform the calibration method as in any one of the first aspects.

In a sixth aspect, there is provided an apparatus (e.g. the apparatus may be a system-on-a-chip) comprising a processor for supporting a first device to implement the functionality referred to in the first aspect above. In one possible design, the apparatus further includes a memory for holding program instructions and data necessary for the first device. When the device is a chip system, the device can be formed by a chip, and can also comprise the chip and other discrete devices.

It should be appreciated that the advantages of the second to sixth aspects may be referred to in the description of the first aspect, and are not described herein.

Drawings

Fig. 1 is a schematic structural diagram of a calibration system according to an embodiment of the disclosure.

Fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Fig. 3 is a schematic hardware structure of a terminal device according to an embodiment of the present disclosure.

Fig. 4 is a schematic software structure of a terminal device according to an embodiment of the present disclosure.

Fig. 5 is a schematic flow chart of a calibration method according to an embodiment of the disclosure.

Fig. 6 is a schematic diagram of a result of preprocessing provided in an embodiment of the present disclosure.

Fig. 7 is a schematic view of a scenario provided in an embodiment of the present disclosure.

Fig. 8 is a schematic display diagram of an application scenario of a calibration method according to an embodiment of the present disclosure.

FIG. 9 is a second flow chart of a calibration method according to an embodiment of the disclosure.

Fig. 10 is a schematic structural diagram of a calibration device according to an embodiment of the disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described below with reference to the drawings in the embodiments of the present disclosure. Wherein, in the description of the present disclosure, "/" means that the related objects are in a "or" relationship, unless otherwise specified, for example, a/B may represent a or B; the "and/or" in the present disclosure is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. Also, in the description of the present disclosure, unless otherwise indicated, "a plurality" means two or more than two. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural. In addition, in order to clearly describe the technical solutions of the embodiments of the present disclosure, in the embodiments of the present disclosure, the words "first", "second", and the like are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ. Meanwhile, in the embodiments of the present disclosure, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "e.g." in the examples of this disclosure should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion that may be readily understood.

In addition, the network architecture and the service scenario described in the embodiments of the present disclosure are for more clearly describing the technical solution of the embodiments of the present disclosure, and do not constitute a limitation on the technical solution provided by the embodiments of the present disclosure, and as a person of ordinary skill in the art can know, with evolution of the network architecture and appearance of a new service scenario, the technical solution provided by the embodiments of the present disclosure is equally applicable to similar technical problems.

With the development of terminal technology and interaction technology, users gradually abandon traditional interaction modes (for example, a mouse and a keyboard are used as input control electronic equipment to execute corresponding instructions, or a touch display screen is used as input control electronic equipment to execute corresponding instructions, etc.), and start to control the electronic equipment to execute corresponding instructions according to the emerging interaction modes. For example, the electronic device is controlled based on line-of-sight interactions, voice interactions, or gesture interactions, among others. The electronic equipment is controlled based on line-of-sight interaction, specifically, the electronic equipment recognizes the eyeball movement of the user, and corresponding user intention instructions are obtained according to the eyeball movement of the user, so that the user intention instructions are executed.

In general, electronic devices may recognize eye movements of a user using eye movement tracking techniques. Eye tracking technology is primarily a process that identifies what the user is looking at and how. Eye tracking technology is now widely used in many fields such as human-computer interaction, virtual reality, vehicle assisted driving, human factor analysis and psychological research.

Eye tracking techniques can be implemented in a variety of ways, for example, geometric methods and artificial intelligence (Artificial Intelligence, AI) methods. Taking the AI method as an example, the eye tracking technology can identify the gaze point of the user (i.e. obtain what the user is looking at) through a gaze point estimation model. In general, before the gaze point estimation model is applied, the gaze point estimation model needs to be calibrated to improve the recognition accuracy of the gaze point estimation model. Currently, there are various calibration methods of the gaze point estimation model, for example, a linear correction method, a fine-tuning (fine-tuning) method, and the like.

The alignment correction method and fine tuning are described in detail below.

The implementation process of the linear correction method is as follows: firstly, acquiring a sight line interaction image and a real gaze point G corresponding to the sight line interaction image ^/ . And then, performing gaze point identification on the sight-line interactive image by using the initial gaze point estimation model to obtain a predicted gaze point G. Finally based on the predicted gaze point G and the real gaze point G ^/ Generating a predicted gaze point G and a real gaze point G ^/ Linear relationship between the two. The linear relationship satisfies the following linear expression:

G ^/ ＝AG+B

wherein, A and B can be multidimensional matrix, A and B can be obtained by least square method.

In determining the predicted gaze point G and the true gaze point G ^/ After the linear relation between the two, other predicted gaze points output by the initial gaze point estimation model can be corrected based on the parameter A and the parameter B in the linear expression, so that the final gaze point is obtained according to the other predicted gaze points.

After the initial gaze point estimation model is calibrated by the method, the final gaze point obtained by the method is found to have larger phase difference with the real gaze point, namely the identification precision of the calibrated gaze point estimation model can not meet the user requirement.

When a user triggers the sight line interaction function of the electronic equipment, the gaze point estimation model in the electronic equipment can be calibrated through a fine tuning method. The implementation process of the fine tuning method is as follows: first, in response to a user's gaze interaction, the electronic device needs to train an initial gaze point estimation model (i.e., adjust parameters in the gaze point estimation model) to obtain a trained gaze point estimation model. And then, the gaze point corresponding to the sight line interaction operation is identified by using the trained gaze point estimation model. And finally triggering the electronic equipment to execute corresponding functional instructions according to the gaze point.

Therefore, if the fine tuning method is used, after the user triggers the sight line interaction function of the electronic device, the user needs to wait for a certain period of time so as to facilitate the electronic device to generate the trained gaze point estimation model. In this way, the electronic device cannot respond to the user demand in real time to execute the corresponding function instruction, and the use experience of the user is reduced to a certain extent. Therefore, there is a need for a better calibration method to calibrate the gaze point estimation model.

Therefore, the embodiment of the disclosure provides a calibration method, which is used for calibrating the gaze point estimation model, and the obtained trained gaze point estimation model has higher recognition accuracy. In addition, the calibration method can train the gaze point estimation model in advance, does not need users to wait when the model is applied, can meet the real-time use requirement of the users, and improves the user satisfaction.

Fig. 1 is a schematic structural diagram of a calibration system according to an embodiment of the disclosure. As shown in fig. 1, the calibration system comprises a training device 101 and a terminal device 102.

The training device 101 may be connected to the terminal device 102 by a wireless communication technology or a wired communication technology. In addition, fig. 1 is an illustration taking a direct connection between the training device 101 and the terminal device 102 as an example, in an actual implementation, a node device such as an edge server, a router, a base station, or a gateway may be disposed between the training device 101 and the terminal device 102, which may be determined according to an actual use requirement, and embodiments of the present disclosure are not limited.

The training device 101 according to the embodiments of the present disclosure may be a general-purpose device or a special-purpose device. For example, the exercise device 101 may be a desktop, laptop, palmtop (personal digital assistant, PDA), mobile handset, tablet, wireless terminal device, embedded device, terminal device, exercise device, or device having a similar structure as in fig. 2. The training device 101 may also comprise a server running independently, or a distributed server, or a server cluster consisting of a plurality of servers. The training device 101 may be a device for training a gaze point estimation model. The training device 101 enables training of the initial gaze point estimation model into a trained gaze point estimation model that can meet the needs of the user. In this way, the terminal device 102 may be connected to the training device 101 through a network, so as to obtain the trained gaze point estimation model, so that the terminal device 102 may respond to the line of sight interaction of the user to implement the corresponding function instruction. Embodiments of the present disclosure are not limited to the particular technology and particular device modality employed by the training device 101.

The terminal device 102 according to the embodiments of the present disclosure may be a user device, a mobile device, a user terminal, a wireless communication device, a user agent, or a user apparatus, and may also be a smart phone, a tablet computer, a notebook computer, a wearable device, a personal computer (personal computer, PC), a vehicle-mounted device, a netbook, or a personal digital assistant (personal digital assistant, PDA), which is not particularly limited in the embodiments of the present disclosure. It should be noted that, the terminal device 102 in the embodiment of the present disclosure may have a photographing function, an algorithm processing function, a communication function, and the like. The embodiment of the present disclosure is not particularly limited to the specific type and structure of the terminal device 102 described above. The embodiment of the present disclosure is not limited to the specific type and structure of the terminal device 102.

It will be appreciated that the training device 101 and the terminal device 102 may be two separate devices or may be the same device. The present disclosure is not limited in this regard.

The system architecture described in the embodiments of the present disclosure is for more clearly describing the technical solution of the embodiments of the present disclosure, and does not constitute a limitation to the technical solution provided by the embodiments of the present disclosure, and as a person of ordinary skill in the art can know that, with evolution of the network architecture and occurrence of a new service scenario, the technical solution provided by the embodiments of the present disclosure is equally applicable to similar technical problems.

Alternatively, the training apparatus 101 in the embodiment of the present disclosure may employ the constituent structure shown in fig. 2 or include the components shown in fig. 2. Fig. 2 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure, as shown in fig. 2, where the electronic device 20 includes one or more processors 201, a communication line 202, and at least one communication interface (fig. 2 is merely exemplary and includes a communication interface 203, and a processor 201 is illustrated as an example), and optionally includes a memory 204.

The processor 201 may be a general purpose central processing unit (central processing unit, CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present disclosure.

The communication line 202 may include a pathway for communication between the different components.

The communication interface 203, which may be a transceiver module, is used to communicate with other devices or communication networks, such as ethernet, RAN, wireless local area network (wireless local area networks, WLAN), etc. For example, the transceiver module may be a device such as a transceiver or a transceiver. Alternatively, the communication interface 203 may be a transceiver circuit located in the processor 201, so as to implement signal input and signal output of the processor.

The memory 204 may be a device having a memory function. For example, but not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc read-only memory (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be stand alone and be coupled to the processor via communication line 202. The memory may also be integrated with the processor.

The memory 204 is used for storing computer-executable instructions for performing aspects of the present disclosure, and is controlled by the processor 601 for execution. The processor 201 is configured to execute computer-executable instructions stored in the memory 204 to implement the calibration method provided in the embodiments of the present disclosure.

Alternatively, in the embodiment of the present disclosure, the processor 201 may perform the functions related to the processing in the calibration method provided in the embodiment of the present disclosure, where the communication interface 203 is responsible for communicating with other devices or communication networks, and the embodiment of the present disclosure is not limited in detail.

Alternatively, computer-executable instructions in embodiments of the present disclosure may also be referred to as application code, which embodiments of the present disclosure are not particularly limited.

In a particular implementation, as one embodiment, processor 201 may include one or more CPUs, such as CPU0 and CPU1 of FIG. 2.

In a particular implementation, as one embodiment, the electronic device 20 may include multiple processors, such as the processor 201 and the processor 207 in FIG. 2. Each of these processors may be a single-core processor or a multi-core processor. The processor herein may include, but is not limited to, at least one of: a central processing unit (central processing unit, CPU), microprocessor, digital Signal Processor (DSP), microcontroller (microcontroller unit, MCU), or artificial intelligence processor, each of which may include one or more cores for executing software instructions to perform operations or processes.

In a particular implementation, electronic device 20 may also include an output device 205 and an input device 206, as one embodiment. The output device 205 communicates with the processor 201 and may display information in a variety of ways. For example, the output device 205 may be a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a Cathode Ray Tube (CRT) display device, or a projector (projector), or the like. The input device 206 is in communication with the processor 201 and may receive user input in a variety of ways. For example, the input device 206 may be a mouse, a keyboard, a touch screen device, a sensing device, or the like.

The electronic device 20 described above may also sometimes be referred to as a communication device, which may be a general purpose device or a special purpose device. For example, the electronic device 20 may be a desktop, a portable computer, a web server, a palm top (personal digital assistant, PDA), a mobile handset, a tablet, a wireless terminal device, an embedded device, a terminal device as described above, a training device as described above, or a device having a similar structure as in fig. 2. The disclosed embodiments are not limited in the type of electronic device 20.

Alternatively, fig. 3 shows a schematic hardware structure of the terminal device 102. As shown in fig. 3, the terminal device may include: processor 310, external memory interface 320, internal memory 331, universal serial bus (universal serial bus, USB) interface 330, charge management module 340, power management module 341, battery 342, antenna 1, antenna 2, mobile communication module 350, wireless communication module 360, audio module 370, speaker 370A, receiver 370B, microphone 370C, headset interface 330D, sensor module 380, keys 390, motor 391, indicator 392, camera 393, display 394, and subscriber identity module (subscriber identification module, SIM) card interface 395, among others. The sensor module 380 may include a pressure sensor 380A, a gyroscope sensor 380B, an air pressure sensor 380C, a magnetic sensor 380D, an acceleration sensor 380E, a distance sensor 380F, a proximity sensor 380G, a fingerprint sensor 380H, a temperature sensor 380J, a touch sensor 380K, an ambient light sensor 380L, a bone conduction sensor 380M, and the like.

It will be appreciated that the structure illustrated in this embodiment does not constitute a specific limitation on the terminal device. In other embodiments, the terminal device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 310 may include one or more processing units, such as: the processor 310 may include an application processor (application processor, AP), a Modem, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The charge management module 340 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger.

The power management module 341 is configured to connect the battery 342, the charge management module 340 and the processor 310. The power management module 341 receives input from the battery 342 and/or the charge management module 340 to power the processor 310, the internal memory 331, the display screen 394, the camera 393, the wireless communication module 360, and the like.

The wireless communication function of the terminal device may be implemented by the antenna 1, the antenna 2, the mobile communication module 350, the wireless communication module 360, the modem, the baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the terminal device may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas.

The mobile communication module 350 may provide a solution for wireless communication including 2G/3G/4G/5G or the like applied on a terminal device.

The wireless communication module 360 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (bl) terminal device, global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (freq) terminal device, short-range wireless communication technology (near field communication, NFC), infrared technology (IR), etc. applied on the terminal device. The wireless communication module 360 may be one or more devices that integrate at least one communication processing module. The wireless communication module 360 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 310. The wireless communication module 360 may also receive a signal to be transmitted from the processor 310, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

The terminal device implements display functions through the GPU, the display screen 394, the application processor, and the like. The GPU is a microprocessor for image processing, connected to the display screen 394 and the application processor.

The display screen 394 is used for displaying images, videos, and the like. A series of graphical user interfaces (graphical user interface, GUI) may be displayed on the display 394 of the terminal device.

The terminal device may implement shooting functions through the ISP, the camera 393, the video codec, the GPU, the display 394, the application processor, and the like.

Camera 393 is used to capture still images or video.

The external memory interface 320 may be used to connect an external memory card, such as a MicroSD card, to enable expansion of the memory capabilities of the terminal device.

The internal memory 331 may be used to store computer executable program code including instructions. The processor 310 executes various functional applications of the terminal device and data processing by executing instructions stored in the internal memory 331.

The terminal device may implement audio functions through an audio module 370, a speaker 370A, a receiver 370B, a microphone 370C, an earphone interface 330D, an application processor, and the like. Such as music playing, recording, etc. The terminal device may also include a pressure sensor 380A, a barometric pressure sensor 380C, a gyroscope sensor 380B, a magnetic sensor 380D, an acceleration sensor 380E, a distance sensor 380F, a proximity sensor 380G, an ambient light sensor 380L, a fingerprint sensor 380H, a temperature sensor 380J, a touch sensor 380K, a bone conduction sensor 380M, keys 390, a motor 391, an indicator 392, and the like.

The SIM card interface 395 is for interfacing with a SIM card. The SIM card may be contacted and separated from the terminal device by being inserted into the SIM card interface 395 or by being withdrawn from the SIM card interface 395. The terminal device may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 395 may support Nano SIM cards, micro SIM cards, and the like. The same SIM card interface 395 can be used to insert multiple cards simultaneously. The SIM card interface 395 may also be compatible with external memory cards. The terminal equipment interacts with the network through the SIM card to realize the functions of communication, data communication and the like.

Further, on the above components, an operating system such as a hong Meng operating system, an iOS operating system, an Android operating system, a Windows operating system, and the like is run. An operating application may be installed on the operating system. In other embodiments, there may be multiple operating systems running within the terminal device.

It should be understood that the hardware modules included in the terminal device shown in fig. 3 are only described by way of example, and are not limiting on the specific structure of the terminal device. In fact, the terminal device provided in the embodiments of the present disclosure may further include other hardware modules having an interaction relationship with the hardware modules illustrated in the drawings, which is not specifically limited herein. For example, the terminal device may also include a flash, a miniature projection device, etc. As another example, if the terminal device is a PC, the terminal device may further include a keyboard, a mouse, and the like.

The software system of the terminal equipment can adopt a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture or a cloud architecture. Embodiments of the invention are configured in a layered mannerThe system is exemplified by the software architecture of a cell phone.

The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate via interfaces. In some embodiments, it willThe system is divided into four layers, namely an application program layer, an application program framework layer, an Zhuoyun row and system library and a kernel layer from top to bottom.

The application layer may include a series of application packages.

As shown in fig. 4, the application package may include applications such as mail, camera, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, etc.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions.

The application framework layer may include an activity manager, a window manager, a content provider, a view system, a resource manager, a notification manager, etc., which embodiments of the present disclosure do not impose any limitations.

Activity Manager (Activity Manager): for managing the lifecycle of each application. Applications typically run in the operating system in the form of activities. For each Activity, there will be an application record (activaterecord) in the Activity manager corresponding to it, which records the status of the application's Activity. The Activity manager may schedule the application's Activity process using this Activity record as an identification.

Window manager (windowmanager service): for managing graphical user interface (graphical user interface, GUI) resources used on screen, in particular: the method comprises the steps of obtaining the screen size, creating and destroying the window, displaying and hiding the window, layout of the window, management of focus, input method and wallpaper management and the like.

The system libraries and kernel layers below the application framework layer may be referred to as an underlying system that includes an underlying display system for providing display services, e.g., the underlying display system includes display drivers in the kernel layer and surfacemanagers in the system libraries, etc.

Android Runtime (Android run) includes a core library and virtual machines. Android run time is responsible for scheduling and management of the Android system. The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android. The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (e.g., openGL ES), two-dimensional image engine (e.g., SGL), algorithm library, etc.

The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.

Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio video encoding formats, such as: MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.

OpenGL ES is used to implement three-dimensional graphics drawing, image rendering, compositing, and layer processing, among others.

SGL is the drawing engine for 2D drawing.

Included in the algorithm library is a trained gaze point estimation model obtained from the training device 101.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The calibration method provided by the embodiment of the present disclosure will be described below with reference to fig. 1 to 4. The device in the embodiments described below may have the components shown in fig. 2 or have the components shown in fig. 3. Wherein, the actions, terms and the like related to the embodiments of the present disclosure can be referred to each other without limitation. The message names of interactions between devices or parameter names in the messages in the embodiments of the present disclosure are just an example, and other names may be used in specific implementations without limitation.

The following describes the flow of the calibration method provided by the embodiments of the present disclosure, which may include the following (1) model training phase; (2) model application stage.

Flow (1) (model training phase) is described below.

Step 501, training equipment acquires a plurality of face images.

Currently, the gaze interaction functionality of most smart devices (e.g., smartphones, tablets, smart screens, head-mounted smart devices (e.g., AR/VR glasses), etc.) is implemented based on eye-tracking technology. Eye tracking techniques may be implemented by gaze point estimation models. Before the eye tracking technique is implemented using the gaze point estimation model, the initial gaze point estimation model needs to be trained using a large amount of training data, so as to obtain a gaze point estimation model that can be used.

When the eye tracking technology is applied to the intelligent device, a plurality of face images can be acquired from the intelligent device. Training data is obtained according to the plurality of face images. And then training the initial gaze point estimation model by using training data, thereby obtaining a gaze point estimation model which is more attached to the intelligent equipment. The present disclosure does not limit the way training data is obtained.

For example, when the user triggers the line-of-sight interaction function of the mobile phone, the mobile phone captures a user object (e.g., a user) by calling a camera or other image capturing module (e.g., an infrared image sensor) in the mobile phone, and obtains a plurality of face images.

The plurality of face images acquired by the training equipment can also be the plurality of face images acquired by shooting the shooting object by calling a camera or other image acquisition modules in the flat plate when the user triggers the sight interaction function of the flat plate.

Or when the plurality of face images acquired by the training equipment are the sight interaction function of the intelligent screen triggered by the user, the intelligent screen shoots a shooting object by calling a camera or other image acquisition modules in the intelligent screen so as to acquire the plurality of face images.

Or when the plurality of face images acquired by the training device are the line-of-sight interaction function of the head-mounted intelligent device triggered by the user, the head-mounted intelligent device shoots a shooting object by calling a camera or other image acquisition modules in the head-mounted intelligent device so as to acquire the plurality of face images.

In some examples, since the eye tracking technique may be applied to a variety of smart devices, the image types of the plurality of face images acquired by the training device may also include a variety of types, for example, the image types of the plurality of face images are RGB face images or I R face images. The RGB face image is an RGB image output by the camera. I R the face image is an infrared image output by the infrared image sensor.

In other examples, the plurality of face images includes a plurality of initial calibration face images, a plurality of initial calibration gaze points, a query face image, and a query gaze point. Each initial calibration fixation point corresponds to an initial calibration face image. The query gaze point corresponds to a query face image. The face image is initially calibrated and the face image is queried as the face image of the same user.

The intelligent equipment is used for acquiring a plurality of initial calibration face images, wherein the plurality of initial calibration face images are initial calibration face images captured by the intelligent equipment when the intelligent equipment is used by a user at a plurality of different moments. The plurality of initial calibration face images may be captured at a plurality of different capture angles.

For example, the plurality of initial calibration face images includes initial calibration face images captured by the smart device when the user a uses the smart device at time t 1. At time t1, the initial calibration face image captured by the smart device may be an image captured when user a views directly over the smart device. When the user A uses the intelligent device at the time t2, the intelligent device captures an initial calibration face image; at time t2, the initial calibration face image captured by the smart device may be an image captured by the user a looking at the middle of the smart device. When the user A uses the intelligent equipment at the time t3, the intelligent equipment captures an initial calibration face image; at time t3, the initial calibration face image captured by the intelligent device may be an image captured when the user a views directly under the intelligent device; … …; when the user A uses the intelligent device at the time t9, the intelligent device captures an initial calibration face image and the like; at time t9, the initial calibration face image captured by the smart device may be an image taken when user a views the left side of the smart device.

The plurality of initial calibration gaze points are locations at which the eyes of the user of the smart device gaze at a plurality of different moments in time.

For example, at time t1, the position of the gaze of the eye of user a; at time t2, the position of the eye gaze of user a; at time t3, the position of the eye gaze of user a; … …; at time t9, the position of the user a's eye gaze, and so on.

In some scenarios, the plurality of initial calibration gaze points may be a plurality of initial calibration gaze points of the mobile phone desktop that the user a directs to view when using the gaze interaction function of the mobile phone. The plurality of initial calibration gaze points may appear sequentially in a fixed order in the cell phone screen. Every time an initial calibration fixation point appears in the mobile phone screen, the mobile phone calls the camera to shoot an initial calibration face image. And simultaneously, creating a corresponding relation between the initial calibration gaze point and the initial calibration face image. And similarly, when the mobile phone screen sequentially presents a plurality of initial calibration gaze points according to the sequence, the mobile phone can call the camera to shoot the initial calibration face image corresponding to each of the plurality of initial calibration gaze points. The method for displaying the initial calibration gaze point in the mobile phone screen is not limited to the display form of the initial calibration gaze point in the mobile phone screen, and the initial calibration gaze point can be displayed in the mobile phone screen in the form of a dot, an explosive star or other forms.

The query gaze point is the position of the user's eyes gazing at the moment to. The query face image is a query face image captured by the intelligent device when the intelligent device is used by a user at the moment to.

For example, the query face image Iq is an image captured at the to time. The query gaze point Gq is the point of time to, the position of the user's eye gaze. Query face image Iq and query gaze pointCorresponding to the above. The shot object in the query face image (i.e., user a) is the same as the shot object in the plurality of initial calibration face images (i.e., user a). The query gaze point and the calibration gaze point may be gaze points in two-dimensional coordinates or gaze points in three-dimensional coordinates.

Step 502, the training device screens the face images and outputs training data.

The training data comprises a query face image, a query gaze point, a plurality of calibration face images and a plurality of calibration gaze points. The plurality of calibration face images and the plurality of calibration gaze points are in one-to-one correspondence.

After the training device acquires the face images, the face images can be screened to screen out images (i.e. training data) meeting the requirements. And then training the initial gaze point estimation model by using training data meeting the requirements, thereby improving the training efficiency of the initial gaze point estimation model.

In some examples, the training device screening the plurality of face images may be the training device screening a plurality of initial calibration face images of the plurality of face images to obtain a plurality of screened calibration face images. And then determining the plurality of calibration face images, a plurality of calibration gaze points corresponding to the plurality of calibration face images, the query face image and the query gaze point as training data.

The training device screening the plurality of initial calibration face images in the plurality of face images means that: the training device sequentially judges the effectiveness of each initial calibration face image in the plurality of initial calibration face images. After judging the validity of each initial calibration face image in the plurality of initial calibration face images, if the initial calibration face images are valid, the initial calibration face images are considered to be images meeting the requirements, the initial calibration face images are output as calibration face images, and the initial calibration gazing point corresponding to the initial calibration face images is output as the calibration gazing point corresponding to the calibration face images. If the initial calibration face image is invalid, the initial calibration face image is considered to be an image which does not meet the requirements, the initial calibration face image is discarded, and meanwhile, an initial calibration fixation point corresponding to the initial calibration face image is discarded. And judging all the initial calibration face images to obtain a plurality of calibration face images and a plurality of calibration gaze points in the training data. The method comprises the steps of judging whether the face is complete, judging whether left eyes and right eyes in the face are open, judging whether the left eyes and the right eyes in the face are blocked or not, and the like.

For example, after the training device screens the initial calibration face images in the face images, the obtained calibration face images are the calibration face image I1, the calibration face image I2, the calibration face image I3, the calibration face image I4, the calibration face image I5, the calibration face image I6, the calibration face image I7, the calibration face image I8 and the calibration face image I9. The calibration face image I1 corresponds to the calibration gaze point G1. The nominal face image I2 corresponds to the nominal gaze point G2. The nominal face image I3 corresponds to the nominal gaze point G3. The nominal face image I4 corresponds to the nominal gaze point G4. The nominal face image I5 corresponds to the nominal gaze point G5. The nominal face image I6 corresponds to the nominal gaze point G6. The nominal face image I7 corresponds to the nominal gaze point G7. The calibrated face image I8 corresponds to the calibrated gaze point G8. The nominal face image I9 corresponds to the nominal gaze point G9.

Finally, calibrating the face image I1 and the gaze point G1; calibrating a face image I2 and a fixation point G2; calibrating a face image I3 and a fixation point G3; calibrating a face image I4 and a fixation point G4; calibrating a face image I5 and a fixation point G5; calibrating a face image I6 and a fixation point G6; calibrating a face image I7 and a fixation point G7; calibrating a gaze point G8 by the face image I8, calibrating a face image I9 and calibrating a gaze point G9; query face image Iq, query gaze point As training data.

In other examples, the training device determining whether the initial calibration face image is complete may be the training device identifying a plurality of initial calibration face images using a face recognition algorithm (e.g., eigenface method (Eigenface), local binary pattern (Local Binary Patterns, LBP), fisherfats algorithm, etc.) to derive whether the initial calibration face image is complete.

The training device determining whether the left eye and the right eye in the initial calibration face image are open may be that the training device identifies a plurality of initial calibration face images by using an open-eye algorithm (for example, hough transform method, gray-scale integral projection method, gabor method, template matching method, etc.) to obtain whether the left eye and the right eye in the initial calibration face image are open.

The training device determining whether the left eye and the right eye in the face are occluded may be that the training device identifies a plurality of initial calibration face images by using an occlusion algorithm (e.g., Z-buffering, backward rendering (Backward rendering), occlusion surface removal (Backface occlusion), etc.) to obtain whether the left eye and the right eye in the initial calibration face images are occluded.

Step 503, the training device preprocesses the training data to obtain identification data corresponding to the training data.

The identification data comprises query face data and a plurality of groups of calibration face data.

The query face data includes a query face region image, a left eye image of a query face, a right eye image of a query face, and a query face mesh image. The query face grid image is used for representing the position and the size of the query face in the whole image.

Each of the plurality of calibration face images corresponds to a set of calibration face data. The set of calibration face data comprises a calibration face area image, a left eye image of a calibration face, a right eye image of the calibration face and a calibration face grid image. The calibration face grid image is used for representing the position and the size of the calibration face in the whole image.

After the training device obtains the training data, the query face image and the plurality of calibration face images in the training data can be preprocessed to obtain the query face data corresponding to the query face image and a group of calibration face data corresponding to each calibration face image in the plurality of calibration face images. And then, the predicted query point of regard can be obtained by utilizing the query face data and the plurality of groups of calibrated face images.

In some examples, the training device pre-processing the query face image in the training data includes: first, the training device performs face detection on the query face image to identify a face contour in the query face image. And obtaining a query face area according to the face outline in the query face image. Then, the training device may perform eye detection on the plurality of query face images to obtain a left eye region of the query face and a right eye region of the query face. Then, the training equipment cuts the query face area, the left eye area of the query face and the right eye area of the query face in the query face image to obtain the query face area image, the left eye image of the query face and the right eye image of the query face. Finally, the training device determines a query face mesh image based on the query face region image and the query face image. Exemplary, a left eye image of a query face, a right eye image of a query face, a query face region image, and a query face mesh image are shown in fig. 6.

The training device determines the query face grid image based on the query face area image and the query face image, and the training device can obtain the query face grid image through the following steps: firstly, determining the position of a query face region image in a query face image, and then obtaining coordinates of four vertexes of the query face region image in the query face image according to the position of the query face region image in the query face image. And then determining the area of the query face region image based on the coordinates of the four vertexes. The size of the query mesh image may then be set according to the user's needs. Illustratively, the query grid image is 25 x 25 in size. And then, according to the size of the query grid image, the size of the query face image, the coordinates of four vertexes of the query face area image and the area of the query face area image, obtaining the position and the size of the query face area image in the grid image. And finally, generating a query face grid image according to the position and the size of the query face area image in the grid image and the query grid image. As shown in fig. 6, the gray grid portions in the query face grid image of fig. 6 are used to characterize the location and size of the query face region image in the overall image.

The face detection described above may be implemented in a variety of ways. Such as template matching, face rules, facial feature point detection, etc. The template matching method is mainly used for determining whether a face exists in an image or not and the position of the face. The principle of the template matching method is to compare a predefined face template with each sub-region in the screened face image so as to judge the face region in the screened face image. The face rule method is a method of detecting and recognizing a face based on a rule set defined in advance. The method judges according to the characteristics of the face, such as the shape, the color, the texture and the like, through a series of rules, thereby realizing the rapid detection and the recognition of the face. Facial feature point detection is mainly used for identifying and screening facial feature key points (such as eyebrow key points, nose key points, left eye key points, right eye key points and mouth key points) in a facial image, then marking a face by using the facial feature key points, and then calculating according to a specific proportional relation so as to obtain a face region.

It should be noted that the above face detection method may be used in combination in actual detection, or only one or more of the above face detection methods may be used. The above-mentioned face detection method is only an example given by the embodiments of the present disclosure, and the electronic device may also use other face detection methods to perform face detection, which is not limited by the present disclosure.

The eye detection described above may also be achieved in a variety of ways. Such as template matching, hough transform, and the like. Wherein, in the eye detection, the template matching method can be as follows: and comparing the pre-trained standard eye template with each sub-region in the screened face image one by one, and determining the sub-region with the highest matching degree as an eye region. Details are similar to the template matching method described above, and will not be described here again. The Hough transform method is an algorithm based on mathematical transformation, and the Hough transform can find specific geometric bodies (such as straight lines, circles and the like) in the screened face images. For eye detection, hough transformation can detect a circular contour corresponding to an exit pupil in a screening face image, and then calculate the position of the pupil based on the circular contour, so as to obtain an eye region.

In other examples, the training device pre-processing the plurality of calibrated face images in the training data includes: firstly, the training equipment carries out face detection on each calibration face image in a plurality of calibration face images so as to identify the face contour in each calibration face image. And obtaining a calibrated face area corresponding to each calibrated face image according to the face outline in each calibrated face image. Then, the training device may perform eye detection on each of the calibration face images to obtain a left eye region of the calibration face and a right eye region of the calibration face corresponding to each of the calibration face images. Then, the training equipment cuts the calibrated face area, the left eye area and the right eye area of each calibrated face image respectively to obtain a calibrated face area image, a calibrated face left eye image and a calibrated face right eye image corresponding to each calibrated face image. Finally, the training equipment determines a calibration face grid image corresponding to each calibration face image based on the calibration face area image and the calibration face image corresponding to each calibration face image. And obtaining a group of calibration face data corresponding to each calibration face image based on the calibration face area image, the left eye image of the calibration face, the right eye image of the calibration face and the calibration face grid image.

The step of determining the calibration face grid image by the training device based on the calibration face area image and the calibration face image is similar to the step of determining the query face grid image by the training device based on the query face area image and the query face image, and is not repeated here.

And 504, the training equipment performs feature extraction on the identification data to obtain training features.

The training features comprise query face features and calibration face features corresponding to each set of calibration face data in the plurality of sets of calibration face data.

After the training equipment obtains the identification data corresponding to the training data, the feature extraction can be carried out on the identification data to obtain training features. And then, carrying out regression processing on the training characteristics to obtain the predicted query gaze point corresponding to the query face image.

In some examples, the initial gaze point estimation model includes an initial feature extraction network and an initial regression network. The initial feature extraction network is used to perform feature extraction on the identification data to output training features. The initial regression network is used for outputting and inquiring the fixation point of the face image according to the training characteristics.

As shown in fig. 7, the feature extraction of the identification data may be that the training device invokes an initial feature extraction network in the initial gaze point estimation model, and performs feature extraction on query face data (i.e., a query face region image, a left eye image of a query face, a right eye image of a query face, and a query face mesh image) and each set of calibration face data (i.e., a calibration face region image, a left eye image of a calibration face, a right eye image of a calibration face, and a calibration face mesh image) in the identification data by using the initial feature extraction network, so as to obtain query face features corresponding to the query face data and a plurality of calibration face features corresponding to the plurality of sets of calibration face data.

For example, the training device performs feature extraction on the query face data in the recognition data to obtain the query face feature Xq. The number of the calibrated face data in the identification data is 9. And respectively extracting the characteristics of the 9 groups of calibrated face data to obtain 9 calibrated face characteristics. The 9 calibration face features are respectively a calibration face feature X1, a calibration face feature X2, a calibration face feature X3, a calibration face feature X4, a calibration face feature X5, a calibration face feature X6, a calibration face feature X7, a calibration face feature X8 and a calibration face feature X9.

In other examples, the initial feature extraction network may be generated based on a classical deep learning neural network model. By way of example, the initial feature extraction network may be constructed based on basic network models such as multi-layer perceptrons (Multilayer Perceptron, MLP), convolutional neural networks (Convolutional Neural Network, CNN), and recurrent neural networks (Recurrent Neural Network, RNN).

MLP is a feed-forward artificial neural network model for mapping multiple data sets of an input onto a single data set of an output. MLP generally includes: an input layer, a plurality of fully connected layers, and an output layer, the input layer may include at least one input, and the output layer may include at least one output. The number of inputs of the input layer, the number of layers of the full connection layer and the number of outputs of the output layer may be determined according to the need.

CNNs generally include: input layer, convolution layer (Convolution Layer), pooling layer (Pooling layer), full connectivity layer (Fully Connected Layer, FC), and output layer. In general, the first layer of CNN is the input layer and the last layer is the output layer. The convolution layer (Convolution Layer) typically contains a number of feature planes, each of which may be made up of a number of rectangularly arranged neural elements. The nerve units of the same feature plane share weights, and the shared weights are convolution kernels. Pooling layers (Pooling layers) typically follow the convolutional layers, and the Pooling layer may obtain features of very large dimensions, cut the features into several regions, and take their maximum or average values to obtain new, smaller dimension features. A full-Connected layer (full-Connected layer) can combine all local features into global features for calculating the score of each last class.

RNNs are a type of recurrent neural network that takes sequence data as input, recurses in the evolution direction of the sequence, and all nodes are chained.

It should be noted that the initial feature extraction network may be constructed based on a plurality of classical neural networks, or may be constructed using only one of the plurality of classical neural networks, which is not limited by the present disclosure. The initial feature extraction network may be constructed based on a CNN network, for example.

And 505, carrying out regression processing on the training characteristics and the plurality of calibration fixation points by the training equipment to obtain a predicted query fixation point.

After the training device obtains the query face feature in the training features and the calibration face feature corresponding to each set of calibration face data in the sets of calibration face data, regression processing can be performed on the query face feature, the plurality of calibration face features and the plurality of calibration gaze points, thereby obtaining the predicted query gaze point.

In some examples, as shown in fig. 7, regression processing is performed on the query face feature, the plurality of calibration face features, and the plurality of calibration gaze points, so that the obtained predicted query gaze point may be obtained by performing full connection processing on the query face feature, the plurality of calibration face features, and the plurality of calibration gaze points collected by the training device by the initial feature extraction network after outputting the query face feature, the plurality of calibration face features, and the plurality of calibration gaze points, and then inputting the spliced query face feature, the plurality of calibration face features, and the plurality of calibration gaze points into the initial regression network, so as to output the predicted query gaze point.

For example, in connection with step 504, query face feature Xq, calibration face feature X1, calibration face feature X2, calibration face feature X3, calibration face feature X4, calibration face feature X5, calibration face feature X6, calibration face feature X7, and calibration are output at the initial feature extraction network After the face feature X8 and the face feature X9 are fixed, the face feature X1, the face feature X2, the face feature X3, the face feature X4, the face feature X5, the face feature X6, the face feature X7, the face feature X8 and the face feature X9 are spliced to obtain the spliced face feature Xc. Combining step 502, then splicing the calibration gazing point G1, the calibration gazing point G2, the calibration gazing point G3, the calibration gazing point G4, the calibration gazing point G5, the calibration gazing point G6, the calibration gazing point G7, the calibration gazing point G8 and the calibration gazing point G9 acquired by the training equipment to obtain a spliced calibration gazing point Gc. Finally, the query face feature Xq, the spliced calibration face feature Xc and the spliced calibration fixation point Gc are spliced and then input into an initial regression network (e.g. a regression), and the regression performs full connection processing on the query face feature Xq, the spliced calibration face feature Xc and the spliced calibration fixation point Gc to output a predicted query fixation point and a predicted query fixation point G _q 。

Step 506, the training device determines a gaze loss value using the predicted query gaze point and the query gaze point.

Wherein the gaze loss value is used to characterize a difference between the predicted query gaze point and the query gaze point.

After deriving the predicted query gaze point and the query gaze point, gaze loss values corresponding to the predicted query gaze point and the query gaze point may be calculated from the gaze loss function.

In some examples, the gaze loss function may be represented by the following expression.

Lg is used to represent a gaze loss value. G _q Representing a predicted point of regard for the query,representing the query gaze point.

Based on the gaze-loss functionIt is known that the calculation of the gaze loss value takes into account the distance between the predicted query gaze point and the query gaze point. Based on +.>It is known that the gaze loss value also takes into account the angle between the predicted query gaze point and the query gaze point.

Therefore, the method can ensure that the deviation between the predicted query point of regard and the query point of regard can be restrained by the loss between the included angles under the condition that the distance between the predicted query point of regard and the query point of regard is small and the included angle between the predicted query point of regard and the query point of regard is large. The gaze loss function set by the present disclosure is more constrained than constraining only the distance between the predicted query gaze point and the query gaze point. The initial gaze point estimation model is trained by using the gaze loss value generated by the gaze loss function, so that the recognition accuracy of the obtained gaze point estimation model is higher, and the predicted gaze point output by the trained gaze point estimation model is closer to the real gaze point.

And 507, iteratively updating an initial regression network in the initial gaze point estimation model by using the gaze loss value to obtain a trained regression network.

As can be seen in connection with step 504, the initial gaze point estimation model includes an initial feature extraction network and an initial regression network. After the initial regression network outputs the predicted query gaze point, a gaze loss value may be determined based on the predicted query gaze point and the query gaze point. As shown in fig. 7, the weight parameters and bias parameters in the initial regression network are then iteratively updated with gaze loss values. For example, a preset condition that the gaze loss value satisfies may be set, and the preset condition may be that the gaze loss value is smaller than the target gaze loss value.

And if the gaze loss value is greater than or equal to the target gaze loss value, adjusting the weight parameters and the bias parameters of the initial feature extraction network. And updating the initial regression network according to the adjusted weight parameters and the bias parameters. And repeating the regression processing on the training characteristics and the plurality of calibration fixation points by using the adjusted initial regression network, further calculating a new fixation loss value, judging whether the new fixation loss value meets the preset condition, and repeating the iteration until the new fixation loss value meets the preset condition, thereby obtaining the trained regression network. The gaze loss value is specifically calculated by inputting a predicted query gaze point and a query gaze point into a gaze loss function. The calculation of the gaze loss value may also be performed by other operations according to the requirements, which are not illustrated here.

Step 508, the training device determines a contrast loss value using the training features.

The training features comprise a query face feature and a plurality of calibration face features. The contrast loss value is used to characterize differences between the query face feature and the plurality of calibration face features.

In combination with the foregoing, the calibration face features are obtained by extracting features from the calibration face data in the training information. The more the face data is calibrated, the more the face features are extracted.

In some examples, the process of determining contrast loss using training features may include: first, a plurality of calibration face features in training features are spliced to obtain spliced calibration features. For example, the calibration face feature X1, the calibration face feature X2, the calibration face feature X3, the calibration face feature X4, the calibration face feature X5, the calibration face feature X6, the calibration face feature X7, the calibration face feature X8, and the calibration face feature X9 in the plurality of calibration face features are spliced to obtain the spliced calibration face feature Xc. And then determining a comparison loss value according to the spliced and calibrated face feature Xc and the query face feature Xq. The contrast loss value Lx can be calculated by a contrast loss function.

The contrast loss function can be expressed by the following expression.

Wherein Lx is used to represent a comparisonLoss value, formula (m- ||Xc-Xq|| ² ) ⁺ The +in (2) is a round, and m is 1.

s=1 is used to indicate that the distance between the calibration gaze point and the query gaze point is less than or equal to a preset distance.

s=0 is used to indicate that the distance between the calibration gaze point and the query gaze point is greater than a preset distance.

For example, the preset distance is 2cm. Namely, when the distance between the calibration fixation point and the query fixation point is less than or equal to 2cm, the formula of Xc-Xq is utilized ² A contrast loss value is calculated. When the distance between the nominal gaze point and the query gaze point is greater than 2cm, then utilize the formula @, m- ||Xc-Xq|| ² ) ⁺ A contrast loss value is calculated. The preset distance may be set based on actual requirements, which the present disclosure does not limit.

Based on the expression of the contrast loss function, the distance between the calibration gaze point and the query gaze point needs to be determined before the contrast loss value is determined. Based on the distance between the calibration gaze point and the query gaze point, a more optimal calculation mode may be determined to calculate the contrast loss value. And then, the initial feature extraction network is iteratively trained by using the contrast loss value, so that the feature extraction effect of the feature extraction network after training is better.

And 509, iteratively updating an initial feature extraction network in the initial gaze point estimation model by using the contrast loss value to obtain a trained feature extraction network.

As can be seen in connection with step 504, the initial gaze point estimation model includes an initial feature extraction network and an initial regression network. After the initial feature extraction network outputs a plurality of calibration face features and query face features, a comparison loss value can be determined according to the plurality of calibration face features and the query face features. As shown in fig. 7, the weight parameters and bias parameters in the initial feature extraction network in the initial gaze point estimation model are then iteratively updated with the contrast loss values. For example, a preset condition that the contrast loss value satisfies may be set, and the preset condition may be that the contrast loss value is smaller than the target contrast loss value.

And if the contrast loss value is greater than or equal to the target contrast loss value, adjusting the weight parameters and the bias parameters of the initial feature extraction network. And updating the initial feature extraction network according to the adjusted weight parameters and the bias parameters. And repeating the feature extraction of the query face data and the plurality of groups of calibration face data by using the adjusted initial feature extraction network, further calculating a new contrast loss value, judging whether the new contrast loss value meets the preset condition, and repeating iteration until the new contrast loss value meets the preset condition, thereby obtaining the trained feature extraction network. The contrast loss value is specifically calculated by inputting the query face features and the plurality of calibration face features into a contrast loss function. The calculation of the contrast loss value may also be performed by other operations according to the requirements, which are not illustrated herein.

After the initial regression network is iteratively updated by using the gaze loss values, a trained regression network may be obtained. After the initial feature extraction network is iteratively updated by using the contrast loss values, a trained feature extraction network can be obtained. The trained gaze point estimation model includes a trained regression network and a trained feature extraction network. The trained gaze point estimation model may then be utilized to generate a predicted gaze point.

Flow (2) (model application phase) is described below.

Step 510, the training device sends the trained gaze point estimation model to the terminal device.

After the training device obtains the trained gaze point estimation model by using the training data, the training device can send the trained gaze point estimation model to the terminal device, so that when the setting of the terminal device is applied to realize the sight line interaction function, the predicted gaze point can be obtained by calling the trained gaze point estimation model. And then enabling the terminal equipment to realize corresponding functional instructions based on the estimated gaze point. Illustratively, the functional instructions include, but are not limited to, opening an application, opening a chat dialog, capturing an image, playing a video, and so forth.

Step 511, the terminal device receives the trained gaze point estimation model.

The setting application of the terminal device receives the trained gaze point estimation model and stores the trained gaze point estimation model in an algorithm library in the system library in fig. 4.

Step 512, the terminal device detects the line-of-sight interaction operation of the user.

The line-of-sight interaction operation is used for executing a function instruction corresponding to the line-of-sight interaction.

For example, the line-of-sight interaction may be the detection by the handset that the user views the display interface of the handset in an attempt to trigger a corresponding functional instruction based on the location viewed by the user.

In some examples, the user needs to trigger the setup application of the handset to turn on the gaze interaction function before the user uses the gaze interaction function of the handset. When the sight line interaction function is started, the setting application program of the mobile phone can trigger the camera of the mobile phone to be started. After the camera is started, if the sight line interaction operation of the user is detected, the camera can be utilized to capture the sight line interaction image corresponding to the sight line interaction operation of the user. And then, determining the estimated gaze point according to the sight line interaction image, and finally executing corresponding functional instructions based on the estimated gaze point. For example, the user opening the gaze interaction function may be achieved by a start operation. The starting operation is any one of a single click operation, a double click operation, a knuckle click and a multi-finger selection operation.

In step 513, in response to the line-of-sight interaction operation, the setting application of the mobile phone sends a shooting notification to the camera of the mobile phone.

Wherein the photographing notification is used to request the camera to capture a current line-of-sight interaction image.

After the mobile phone detects the sight line interaction operation of the user, the mobile phone setting application sends a shooting notice to the camera of the mobile phone as a response to instruct to start the camera of the mobile phone, so that the camera of the mobile phone captures a sight line interaction image corresponding to the sight line interaction operation.

Step 514, the camera of the mobile phone receives the shooting notification.

Step 515, in response to the photographing notification, the camera of the mobile phone acquires the line-of-sight interaction image and sends the line-of-sight interaction image to the setting application.

After the camera of the mobile phone receives the shooting notification, the camera can respond to the shooting notification to acquire the sight interaction image. And then the camera of the mobile phone can send the sight-line interaction image to the setting application, so that the setting application program of the mobile phone can determine the estimated gaze point of the user based on the sight-line interaction image, and further execute the function instruction corresponding to the estimated gaze point.

Step 516, the setup application receives a gaze interaction image.

And 517, the setting application determines an estimated gaze point corresponding to the sight-line interaction image by using the trained gaze point estimation model, and executes corresponding functional instructions based on the estimated gaze point.

And after the setting application receives the sight line interaction image, calling a trained gaze point estimation model in the algorithm library. And determining the estimated gaze point corresponding to the sight-line interaction image by using the trained gaze point estimation model. And finally, based on the position of the estimated gaze point, calling a corresponding functional control to execute a corresponding functional instruction.

Illustratively, as shown in (a) of fig. 8, in response to an unlocking operation by a user, the mobile phone enters a display interface 800, and the display interface 800 (i.e., a desktop) includes an area 801, an area 802, and an area 803. Wherein, the area 801 is used for displaying conventional prompt data, such as: time, weather, address, etc. (i.e., 08:00,1 month No. 1, thursday, sunny, XX zone). Region 802 is used to display application icons, such as: video icons, run health icons, weather icons, browser icons, radio icons, setup icons, recorder icons, application mall icons, and the like. The area 803 is used to display an application icon fixed to the bottom bar of the mobile phone (when the display interface is switched, the application icon is not changed), for example: camera icons, address book icons, phone icons, and information icons.

At this time, the WeChat application receives a new WeChat message, and the handset displays a prompt 804 as shown in FIG. 8 (b). The prompt 804 is used to prompt the user that the WeChat application has just received a WeChat message from WeChat A. In the case that the user has started the gaze interaction function of the mobile phone, in response to the gaze interaction operation of the user, the mobile phone may determine that the current gaze point of the user falls at the prompt message 804 by using the gaze point estimation model after training. As shown in fig. 8 (c), the handset displays a line-of-sight interaction control 805 at prompt 804. After the line-of-sight interaction control 805 is displayed for 2s, the cell phone displays an interface 806 as shown in (d) of fig. 8. Interface 806 is a chat interface of the user with contact small a.

In addition, in the above embodiment, the application of the trained gaze point estimation model to the mobile phone is illustrated as an application scenario, it may be appreciated that the trained gaze point estimation model may also be applied to other intelligent devices with line-of-sight interaction functions, such as vehicle devices, tablet computers, watches, smartphones, tablets, smart screens, head-mounted intelligent devices (e.g., AR/VR glasses), etc., which the disclosed embodiment does not limit.

After the training equipment acquires the query face image characteristics corresponding to the query face image and the plurality of calibration face image characteristics corresponding to the plurality of calibration face images by utilizing the initial characteristic extraction network in the initial gaze point estimation model, regression processing is carried out on the query face image characteristics, the plurality of calibration face image characteristics and the plurality of calibration gaze points by utilizing the initial regression network. That is, the initial regression network introduces a plurality of calibration gaze points (i.e., real gaze points) when performing the regression process, which may be used to correct the regression direction of the initial regression network, so that the predicted gaze point output by the initial regression network is more accurate. And finally, iteratively updating the initial gaze point estimation model by utilizing the gaze loss value and the contrast loss value so as to ensure that the recognition accuracy of the trained gaze point estimation model is higher.

In addition, the calculation of the gaze loss value takes into account not only the distance between the predicted query gaze point and the query gaze point, but also the angle between the predicted query gaze point and the query gaze point. Therefore, the method can ensure that the deviation between the predicted query point of regard and the query point of regard can be restrained by the loss between the included angles under the condition that the distance between the predicted query point of regard and the query point of regard is small and the included angle between the predicted query point of regard and the query point of regard is large. The gaze loss function set by the present disclosure is more constrained than constraining only the distance between the predicted query gaze point and the query gaze point. Therefore, the initial gaze point estimation model is updated based on the gaze loss value in an iterative manner, a gaze point estimation model with higher recognition accuracy can be obtained, and the predicted gaze point output by the trained gaze point estimation model is closer to the real gaze point.

The calibration method provided by the embodiment of the present disclosure is described below with reference to fig. 9. As shown in fig. 9, the calibration method may include the following steps 901-905.

Step 901, acquiring a sample data set, wherein the sample data set comprises a first face image, a first gaze point corresponding to the first face image, a plurality of second face images and a second gaze point corresponding to each second face image; the first face image and the second face image are face images of the same user.

The first face image may be a face image to be processed by the initial gaze point estimation model, and may also be referred to as a query face image in the embodiment shown in fig. 5. The first gaze point may also be referred to as the query targeting point in the embodiment shown in fig. 5 described above. The second face image is a baseline reference image and may also be referred to as the nominal face image in the embodiment shown in fig. 5 described above. The second gaze point may also be referred to as the nominal gaze point in the embodiment shown in fig. 5 described above. For details, see steps 501-502, which are not described here.

And 902, respectively extracting features of the first face image and the plurality of second face images based on an initial feature extraction network in the initial gaze point estimation model to obtain first face image features and a plurality of second face image features.

The first face image feature may also be referred to as the query face feature in the embodiment shown in fig. 5. The second face image feature may also be referred to as the nominal face feature in the embodiment shown in fig. 5 described above. Details refer to step 503 and step 504, and are not described here again.

And 903, carrying out regression processing on the first face image features, the plurality of second face image features and the plurality of second gaze points based on an initial regression network in the initial gaze point estimation model to obtain a predicted gaze point corresponding to the first face image.

The predicted gaze point may also be referred to as the predicted query gaze point in the embodiment shown in fig. 5 described above. Details refer to step 505 and are not described here.

Step 904, calculating a first penalty value between the predicted gaze point and the first gaze point, and a second penalty value between the first face image feature and the plurality of second face image feature predictions.

The first loss value may also be referred to as the gaze loss value in the embodiment shown in fig. 5. Details refer to step 506, which is not described here. The second loss value may also be referred to as the comparative loss value in the embodiment shown in fig. 5 described above. Details refer to step 508, which is not described here again.

Step 905, iteratively updating the initial gaze point estimation model according to the first loss value and the second loss value to obtain a trained gaze point estimation model.

Details refer to step 507 and step 509, which are not described here again.

In some examples, performing feature extraction on the first face image and the plurality of second face images based on an initial feature extraction network in an initial gaze point estimation model, respectively, to obtain a first face image feature and a plurality of second face image features, including: preprocessing a first face image to obtain first identification data corresponding to the first face image, wherein the first identification data comprises a left eye image of the first face, a right eye image of the first face, a first face area image and a first face grid image; preprocessing the plurality of second face images to obtain second identification data corresponding to each second face image in the plurality of second face images; the second is that the data include the left eye image of the second face, the right eye image of the second face, the second face area image and the second face grid image; extracting features of the first identification data by using an initial feature extraction network to obtain first face image features; and carrying out feature extraction on the plurality of second identification data by utilizing the initial feature extraction network to obtain a plurality of second face image features.

The first recognition data may also be referred to as query face data in the embodiment shown in fig. 5, the left-eye image of the first face may also be referred to as a left-eye image of the query face in the embodiment shown in fig. 5, the right-eye image of the first face may also be referred to as a right-eye image of the query face in the embodiment shown in fig. 5, the first face region image may also be referred to as a query face region image in the embodiment shown in fig. 5, and the first face mesh image may also be referred to as a query face mesh image in the embodiment shown in fig. 5. The second recognition data may also be referred to as the calibration face data in the embodiment shown in fig. 5, the left-eye image of the second face may also be referred to as the left-eye image of the calibration face in the embodiment shown in fig. 5, the right-eye image of the second face may also be referred to as the right-eye image of the calibration face in the embodiment shown in fig. 5, the second face region image may also be referred to as the calibration face region image in the embodiment shown in fig. 5, and the second face mesh image may also be referred to as the calibration face mesh image in the embodiment shown in fig. 5. Details refer to step 503 and step 504, and are not described here again.

In some examples, performing regression processing on the first face image feature, the plurality of second face image features, and the plurality of second gaze points based on an initial regression network in the initial gaze point estimation model to obtain a predicted gaze point corresponding to the first face image, including: splicing the first face image features, the plurality of second face image features and the plurality of second fixation points to obtain spliced data; and carrying out regression processing on the spliced data based on the initial regression network to obtain the predicted fixation point.

Details refer to step 505 and are not described here.

In some examples, iteratively updating the initial gaze point estimation model based on the first loss value and the second loss value to obtain a trained gaze point estimation model includes: iteratively updating the initial regression network according to the first loss value to obtain a trained regression network; iteratively updating the initial feature extraction network according to the second loss value to obtain a trained feature extraction network; the trained gaze point estimation model comprises a trained regression network and a trained feature extraction network.

Details refer to step 507 and step 509, which are not described here again.

In some examples, the first loss value satisfies the following relationship:

Wherein Lg is used to represent a first loss value, G _q For the representation of the predicted gaze point,for representing a first gaze point. Details refer to step 506, which is not described here.

In some examples, the second loss value satisfies the following relationship:

wherein Lx is used for representing a second loss value, xc is used for representing a plurality of second face image features, xq is used for representing a first face image feature, m=1, s=1 is used for representing that the distance between the first gaze point and the second gaze point is smaller than or equal to a preset distance, and s=0 is used for representing that the distance between the first gaze point and the second gaze point is larger than the preset distance. Details refer to step 508, which is not described here again.

In some examples, the preset distance is 2cm.

In some examples, the method further comprises: and sending the trained gaze point estimation model to the terminal equipment so that the terminal equipment can apply the trained gaze point estimation model to output the estimated gaze point corresponding to the to-be-processed sight line interaction image.

The line-of-sight interaction image to be processed may also be referred to as the line-of-sight interaction image in the embodiment shown in fig. 5, and the details refer to steps 510-517, which are not repeated here.

Corresponding to the method in the foregoing embodiment, the embodiment of the present disclosure further provides a calibration device. The calibration device can be applied to an electronic apparatus for implementing the method in the foregoing embodiment. The function of the calibration device can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

For example, fig. 10 shows a schematic structural diagram of a calibration device 1000, and as shown in fig. 10, the calibration device 1000 may include: an acquisition module 1001, a feature extraction module 1002, a processing module 1003, a determination module 1004, an update module 1005, and the like.

An obtaining module 1001 configured to obtain a sample data set, where the sample data set includes a first face image, a first gaze point corresponding to the first face image, a plurality of second face images, and a second gaze point corresponding to each of the second face images; the first face image and the second face image are face images of the same user;

the feature extraction module 1002 is configured to perform feature extraction on the first face image and the plurality of second face images based on an initial feature extraction network in the initial gaze point estimation model, so as to obtain a first face image feature and a plurality of second face image features;

the processing module 1003 is configured to perform regression processing on the first face image feature, the plurality of second face image features and the plurality of second gaze points based on an initial regression network in the initial gaze point estimation model, so as to obtain a predicted gaze point corresponding to the first face image;

a determining module 1004 configured to calculate a first loss value between the predicted gaze point and the first gaze point, and a second loss value between the first face image feature and the plurality of second face image feature predictions;

An updating module 1005 is configured to iteratively update the initial gaze point estimation model according to the first loss value and the second loss value, resulting in a trained gaze point estimation model.

In a possible implementation manner, the feature extraction module 1002 is further configured to pre-process the first face image, so as to obtain first identification data corresponding to the first face image, where the first identification data includes a left eye image of the first face, a right eye image of the first face, a first face region image, and a first face mesh image; preprocessing the plurality of second face images to obtain second identification data corresponding to each second face image in the plurality of second face images; the second is that the data include the left eye image of the second face, the right eye image of the second face, the second face area image and the second face grid image; extracting features of the first identification data by using an initial feature extraction network to obtain first face image features; and carrying out feature extraction on the plurality of second identification data by utilizing the initial feature extraction network to obtain a plurality of second face image features.

In a possible implementation manner, the processing module 1003 is further configured to splice the first face image feature, the plurality of second face image features, and the plurality of second gaze points to obtain spliced data; and carrying out regression processing on the spliced data based on the initial regression network to obtain the predicted fixation point.

In one possible implementation, the updating module 1005 is further configured to iteratively update the initial regression network according to the first loss value, resulting in a trained regression network; iteratively updating the initial feature extraction network according to the second loss value to obtain a trained feature extraction network; the trained gaze point estimation model comprises a trained regression network and a trained feature extraction network.

In one possible implementation, the first loss value satisfies the following relationship:

In one possible implementation, the second loss value satisfies the following relationship:

In one possible implementation, the preset distance is 2cm.

In one possible implementation, the calibration device 1000 may further include a sending module 1006. And the sending module 1006 is configured to send the trained gaze point estimation model to the terminal device, so that the terminal device outputs the estimated gaze point corresponding to the to-be-processed line-of-sight interaction image by applying the trained gaze point estimation model.

It should be understood that the division of units or modules (hereinafter referred to as units) in the above apparatus is merely a division of logic functions, and may be fully or partially integrated into one physical entity or may be physically separated. And the units in the device can be all realized in the form of software calls through the processing element; or can be realized in hardware; it is also possible that part of the units are implemented in the form of software, which is called by the processing element, and part of the units are implemented in the form of hardware.

For example, each unit may be a processing element that is set up separately, may be implemented as integrated in a certain chip of the apparatus, or may be stored in a memory in the form of a program, and the functions of the unit may be called and executed by a certain processing element of the apparatus. Furthermore, all or part of these units may be integrated together or may be implemented independently. The processing element herein may also be referred to as a processor and may be an integrated circuit with signal processing capabilities. In implementation, each step of the above method or each unit above may be implemented by an integrated logic circuit of hardware in a processor element or in the form of software called by a processing element.

In one example, the units in the above apparatus may be one or more integrated circuits configured to implement the above method, for example: one or more ASICs, or one or more DSPs, or one or more FPGAs, or a combination of at least two of these integrated circuit forms.

For another example, when the units in the apparatus may be implemented in the form of a scheduler of processing elements, the processing elements may be general-purpose processors, such as CPUs or other processors that may invoke programs. For another example, the units may be integrated together and implemented in the form of a system on chip SOC.

In one implementation, the above means for implementing each corresponding step in the above method may be implemented in the form of a processing element scheduler. For example, the apparatus may comprise a processing element and a storage element, the processing element invoking a program stored in the storage element to perform the method of the above method embodiments. The memory element may be a memory element on the same chip as the processing element, i.e. an on-chip memory element.

In another implementation, the program for performing the above method may be on a memory element on a different chip than the processing element, i.e. an off-chip memory element. At this point, the processing element invokes or loads a program from the off-chip storage element onto the on-chip storage element to invoke and execute the method of the above method embodiment.

For example, embodiments of the present disclosure may also provide an apparatus such as: an electronic device may include: a processor, a memory for storing instructions executable by the processor. The processor is configured to execute the above instructions, causing the electronic device to implement the calibration method as in the previous embodiment. The memory may be located within the electronic device or may be located external to the electronic device. And the processor includes one or more.

In yet another implementation, the unit implementing each step in the above method may be configured as one or more processing elements, where the processing elements may be disposed on the electronic device corresponding to the above, and the processing elements may be integrated circuits, for example: one or more ASICs, or one or more DSPs, or one or more FPGAs, or a combination of these types of integrated circuits. These integrated circuits may be integrated together to form a chip.

For example, the embodiment of the present disclosure also provides a chip, which may be applied to the above-described electronic device. The chip includes one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected through a circuit; the processor receives and executes computer instructions from the memory of the electronic device through the interface circuit to implement the methods of the above method embodiments.

Embodiments of the present disclosure also provide a computer readable storage medium having computer program instructions stored thereon. The computer program instructions, when executed by the electronic device, enable the electronic device to implement the calibration method as described above.

Embodiments of the present disclosure also provide a computer program product comprising computer instructions for running in an electronic device as described above, which when run in the electronic device, enable the electronic device to implement a calibration method as described above. From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be implemented by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to implement all or part of the functions described above.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, such as: and (5) program. The software product is stored in a program product, such as a computer readable storage medium, comprising instructions for causing a device (which may be a single-chip microcomputer, chip or the like) or processor (processor) to perform all or part of the steps of the methods of the various embodiments of the disclosure. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

For example, embodiments of the present disclosure may also provide a computer-readable storage medium having computer program instructions stored thereon. The computer program instructions, when executed by an electronic device, cause the electronic device to implement the calibration method as in the method embodiments described above.

The foregoing is merely a specific embodiment of the disclosure, but the protection scope of the disclosure is not limited thereto, and any changes or substitutions within the technical scope of the disclosure should be covered in the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of calibration, the method comprising:

acquiring a sample data set, wherein the sample data set comprises a first face image, a first gaze point corresponding to the first face image, a plurality of second face images and a second gaze point corresponding to each second face image; the first face image and the second face image are face images of the same user;

respectively extracting features of the first face image and the plurality of second face images based on an initial feature extraction network in an initial gaze point estimation model to obtain a first face image feature and a plurality of second face image features;

Performing regression processing on the first face image features, the plurality of second face image features and the plurality of second gaze points based on an initial regression network in the initial gaze point estimation model to obtain predicted gaze points corresponding to the first face images;

calculating a first loss value between the predicted gaze point and the first gaze point, and a second loss value between the first face image feature and the plurality of second face image feature predictions;

and iteratively updating the initial gaze point estimation model according to the first loss value and the second loss value to obtain a trained gaze point estimation model.

2. The method according to claim 1, wherein the feature extraction of the first face image and the plurality of second face images based on the initial feature extraction network in the initial gaze point estimation model to obtain a first face image feature and a plurality of second face image features, respectively, includes:

preprocessing the first face image to obtain first identification data corresponding to the first face image, wherein the first identification data comprises a left eye image of the first face, a right eye image of the first face, a first face area image and a first face grid image;

Preprocessing the plurality of second face images to obtain second identification data corresponding to each of the plurality of second face images; the second data comprises a left eye image of a second face, a right eye image of the second face, a second face region image and a second face grid image;

performing feature extraction on the first identification data by using the initial feature extraction network to obtain the first face image features;

and carrying out feature extraction on the plurality of second identification data by utilizing the initial feature extraction network to obtain the plurality of second face image features.

3. The method according to claim 1 or 2, wherein the performing regression processing on the first face image feature, the plurality of second face image features, and the plurality of second gaze points based on the initial regression network in the initial gaze point estimation model to obtain a predicted gaze point corresponding to the first face image includes:

splicing the first face image features, the plurality of second face image features and the plurality of second gaze points to obtain spliced data;

and carrying out regression processing on the spliced data based on the initial regression network to obtain the predicted gaze point.

4. A method according to any one of claims 1-3, characterized in that said iteratively updating said initial gaze point estimation model based on said first loss value and said second loss value, resulting in a trained gaze point estimation model, comprises:

iteratively updating the initial regression network according to the first loss value to obtain a trained regression network;

iteratively updating the initial feature extraction network according to the second loss value to obtain a trained feature extraction network; the trained gaze point estimation model comprises the trained regression network and the trained feature extraction network.

5. The method according to any one of claims 1-4, wherein the first loss value satisfies the relationship:

wherein, the Lg is used for representing a first loss value, and G is _q For representing a predicted gaze point, said G ∈ - _q For representing a first gaze point.

6. The method according to any one of claims 1-5, characterized in that the second loss value satisfies the following relation:

the Lx is used for representing a second loss value, the Xc is used for representing a plurality of second face image features, the Xq is used for representing a first face image feature, the m=1, the s=1 is used for representing that the distance between the first gazing point and the second gazing point is smaller than or equal to a preset distance, and the s=0 is used for representing that the distance between the first gazing point and the second gazing point is larger than the preset distance.

7. The method of claim 6, wherein the predetermined distance is 2cm.

8. The method according to any one of claims 1-7, wherein the method further comprises:

and sending the trained gaze point estimation model to a terminal device, so that the terminal device outputs an estimated gaze point corresponding to the sight line interaction image to be processed by applying the trained gaze point estimation model.

9. An electronic device, comprising:

a touch screen including a touch sensor and a display screen;

one or more processors;

a memory;

wherein the memory has stored therein one or more computer programs, the one or more computer programs comprising instructions, which when executed by the electronic device, cause the electronic device to perform the one calibration method of any of claims 1-8.

10. A computer readable storage medium having stored thereon computer program instructions; it is characterized in that the method comprises the steps of,

the computer program instructions, when executed by an electronic device, cause the electronic device to implement the one calibration method as claimed in any one of claims 1 to 8.